https://arxiv.org/api/hnbKLK7G2blJxa2PW2H9E09ngy42026-06-14T00:14:47Z606552515http://arxiv.org/abs/2511.01454v1"Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG2025-11-03T11:11:27ZTranslating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.2025-11-03T11:11:27ZSergio Torres Aguilarhttp://arxiv.org/abs/2511.01439v1Reforming research funding: Combining editorial preregistration with grant peer review2025-11-03T10:47:36ZCompetitive grant funding is associated with high costs and a potential bias to favor conservative research. This comment proposes integrating editorial preregistration, in the form of registered reports, into grant peer review processes as a reform strategy. Linking funding decisions to in principle accepted study protocols would reduce reviewer burden, strengthen methodological rigor, and provide an institutional foundation for (more) replication, theory driven research, and high risk research. Our proposal also minimizes strategic proposal writing and ensures scholarly output through the publication of preregistered protocols, regardless of funding outcomes. Possible implementation models include direct coupling of journal acceptance with funding, co review mechanisms, voucher systems, and lotteries. While challenges remain in aligning journal and funding agency procedures, the integration of preregistration and funding offers a promising pathway toward a more transparent and efficient research ecosystem.2025-11-03T10:47:36ZLutz BornmannGerald Schweigerhttp://arxiv.org/abs/2511.01353v1AI Literacy in UAE Libraries: Assessing Competencies, Training Needs, and Ethical Considerations for the Digital Age2025-11-03T09:00:15ZThe study explores the current state of artificial intelligence (AI) literacy levels among library professionals employing a quantitative approach consisting of 92 surveys of LIS professionals in the United Arab Emirates (UAE). Findings of the study revealed the presence of strong cognitive competencies, while there were gaps observed in behavioral and normative competencies, especially related to AI biases, AI-powered learning, and ethical considerations. There was a disconnect observed between the perceived importance of AI skills and the effectiveness of the current training programs.2025-11-03T09:00:15ZThis is the accepted manuscript version. The final published version will appear in College & Research Libraries, November 2026Zafar Imam Khanhttp://arxiv.org/abs/2511.01942v1Towards Defect Phase Diagrams: From Research Data Management to Automated Workflows2025-11-03T07:39:32ZDefect phase diagrams provide a unified description of crystal defect states for materials design and are central to the scientific objectives of the Collaborative Research Centre (CRC) 1394. Their construction requires the systematic integration of heterogeneous experimental and simulation data across research groups and locations. In this setting, research data management (RDM) is a key enabler of new scientific insight by linking distributed research activities and making complex data reproducible and reusable.
To address the challenge of heterogeneous data sources and formats, a comprehensive RDM infrastructure has been established that links experiment, data, and analysis in a seamless workflow. The system combines: (1) a joint electronic laboratory notebook and laboratory information management system, (2) easy-to-use large-object data storage, (3) automatic metadata extraction from heterogeneous and proprietary file formats, (4) interactive provenance graphs for data exploration and reuse, and (5) automated reporting and analysis workflows. The two key technological elements are the openBIS electronic laboratory notebook and laboratory information management system, and a newly developed companion application that extends openBIS with large-scale data handling, automated metadata capture, and federated access to distributed research data.
This integrated approach reduces friction in data capture and curation, enabling traceable and reusable datasets that accelerate the construction of defect phase diagrams across institutions.2025-11-03T07:39:32ZKhalil RejibaSang-Hyeok LeeChristina GasperMartina FreundSandra Korte-KerzelUlrich Kerzel10.1002/adem.202502882http://arxiv.org/abs/2511.01113v1S2Doc -- Spatial-Semantic Document Format2025-11-02T23:06:03ZDocuments are a common way to store and share information, with tables being an important part of many documents. However, there is no real common understanding of how to model documents and tables in particular. Because of this lack of standardization, most scientific approaches have their own way of modeling documents and tables, leading to a variety of different data structures and formats that are not directly compatible. Furthermore, most data models focus on either the spatial or the semantic structure of a document, neglecting the other aspect. To address this, we developed S2Doc, a flexible data structure for modeling documents and tables that combines both spatial and semantic information in a single format. It is designed to be easily extendable to new tasks and supports most modeling approaches for documents and tables, including multi-page documents. To the best of our knowledge, it is the first approach of its kind to combine all these aspects in a single format.2025-11-02T23:06:03Z8 pages, 2 figures, submitted to LREC2026Sebastian KempfFrank Puppehttp://arxiv.org/abs/2407.17032v4Gymnasium: A Standard Interface for Reinforcement Learning Environments2025-11-02T13:42:19ZReinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium2024-07-24T06:35:05ZAccepted at NeurIPS Datasets and Benchmarks 2025Mark TowersAriel KwiatkowskiJordan TerryJohn U. BalisGianluca De ColaTristan DeleuManuel GoulãoAndreas KallinterisMarkus KrimmelArjun KGRodrigo Perez-VicenteAndrea PierréSander SchulhoffJun Jet TaiHannah TanOmar G. Younishttp://arxiv.org/abs/2510.27259v1Research Output of Webology Journal (2013-2017): A Scientometric Analysis2025-10-31T07:55:16ZWebology is an international peer-reviewed journal in English devoted to the field of the World Wide Web and serves as a forum for discussion and experimentation. It serves as a forum for new research in information dissemination and communication processes in general, and in the context of the World Wide Web in particular. This paper presents a Scientometric analysis of the Webology Journal. The paper analyses the pattern of growth of the research output published in the journal, pattern of authorship, author productivity, and subjects covered to the papers over the period (2013-2017). It is found that 62 papers were published during the period of study (2013-2017). The maximum numbers of articles were collaborative in nature. The subject concentration of the journal noted was Social Networking/Web 2.0/Library 2.0 and Scientometrics or Bibliometrics. Iranian researchers contributed the maximum number of articles (37.10%). The study applied standard formula and statistical tools to bring out the factual result.2025-10-31T07:55:16Z13 pages, 3 figures, Research PaperInternational Journal of Movement Education and Social Science; Volume 7 Issue 3; 2018Muneer AhmadM. Sadik BatchaBasharat Ahmad WaniMohammad Idrees KhanS. Roselin Jahinahttp://arxiv.org/abs/2308.12883v2Computational Dating for the Nuzi Cuneiform Archive: The Least Squares Constrained by Family Trees and Synchronisms2025-10-31T05:33:23ZWe introduce a computational method of dating for an archive in ancient Mesopotamia. We use the name index Nuzi Personal Names (NPN) published in 1943. We made an electronic version of NPN and added the kinships of the two powerful families to NPN to reflect the Nuzi studies after 1943. Nuzi is a town from the 15th - 14th century B.C.E.for a period of some five generations in Arrapha. The cuneiform tablets listed in NPN are for contracts on land transactions, marriage, loans, slavery, etc. In NPN, the kinships and cuneiform tablets (contracts, documents, texts) involved are listed for each person. We reconstruct family trees from the added NPN to formulate the least squares problem with the constraints: a person's father is at least 22.5 years older than the person, contractors were living at the time of the contract, etc. Our results agree with the Assyriological results of M. P. Maidman on the seniority among siblings of a powerful family. Our method could be applied to the other clay tablet archives once we have the name index in the format of NPN.2023-08-23T07:59:25ZSumie UedaTakashi TsuchiyaYoshiaki Itohhttp://arxiv.org/abs/2510.25718v1Retrieval-Augmented Search for Large-Scale Map Collections with ColPali2025-10-29T17:27:21ZMultimodal approaches have shown great promise for searching and navigating digital collections held by libraries, archives, and museums. In this paper, we introduce map-RAS: a retrieval-augmented search system for historic maps. In addition to introducing our framework, we detail our publicly-hosted demo for searching 101,233 map images held by the Library of Congress. With our system, users can multimodally query the map collection via ColPali, summarize search results using Llama 3.2, and upload their own collections to perform inter-collection search. We articulate potential use cases for archivists, curators, and end-users, as well as future work with our system in both machine learning and the digital humanities. Our demo can be viewed at: http://www.mapras.com.2025-10-29T17:27:21Z5 pages, 5 figuresJamie MahowaldBenjamin Charles Germain Leehttp://arxiv.org/abs/2510.25283v1Measuring the Research Output and Performance of the University of Ibadan from 2014 to 2023: A Scientometric Analysis2025-10-29T08:39:36ZThis study employs scientometric methods to assess the research output and performance of the University of Ibadan from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university's research productivity, impact, and disciplinary focus. This article's endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Ibadan a significant hub for cutting-edge research in Nigeria and beyond. The goal of the current study is to ascertain the influence of the university's research output and publication patterns between 2014 and 2023. The study focuses on the departments at the University of Ibadan that contribute the most, the best journals for publishing, the nations that collaborate, the impact of citations both locally and globally, well-known authors and their total production, and the research output broken down by year. According to the university's ten-year publication data, 7159 papers with an h-index of 75 were published between 2014 and 2023, garnering 218572 citations. Furthermore, the VOSviewer software mapping approach is used to illustrate the stenographical mapping of data through graphs. The findings of this study will contribute to understanding the university's research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Ibadan.2025-10-29T08:39:36Z16 pages, 5 figures, Research PaperNigerian Libraries; Volume 59, Issue 1; 2025Muneer AhmadUndie Felicia Nkatv10.61955/HFYDJHhttp://arxiv.org/abs/2510.26824v1LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature2025-10-28T17:58:18ZThe development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.2025-10-28T17:58:18Z29 pages, 13 figures, 6 tablesMagdalena LederbauerSiddharth BetalaXiyao LiAyush JainAmine SehabaGeorgia ChanningGrégoire GermainAnamaria LeonescuFaris FlaifilAlfonso AmayuelasAlexandre NozadzeStefan P. SchmidMohd ZakiSudheesh Kumar EthirajanElton PanMathilde FranckelAlexandre DuvalN. M. Anoop KrishnanSamuel P. Gleasonhttp://arxiv.org/abs/2506.03187v2Comparing Retrieval Strategies to Capture Interdisciplinary Scientific Research: A Bibliometric Evaluation of the Integration of Neuroscience and Computer Science2025-10-28T17:05:07ZInterdisciplinary scientific research is increasingly important in knowledge production, funding policies, and academic discussions on scholarly communication. While many studies focus on interdisciplinary corpora defined a priori -- usually through keyword-based searches within assumed interdisciplinary domains -- few explore interdisciplinarity as an emergent intersection between two distinct fields. Thus, methodological proposals for building databases at the intersection of two fields of knowledge are scarce. The goal of this article is to develop and compare different strategies for defining an interdisciplinary corpus between two bodies of knowledge. As a case study, we focus on the intersection between neuroscience and computer science. To this end, we develop and compare four retrieval strategies, two of them based on keywords and two based on citation and reference patterns. Our results show that the reference-based strategy provides better retrieval, pseudorecall, and F1. While we focus on comparing strategies for the study of the intersection between the fields of neuroscience and computer science, this methodological reflection is applicable to a wide range of interdisciplinary domains.2025-05-30T19:29:18ZMalena Mendez IslaAgustin MauroDiego Kozlowskihttp://arxiv.org/abs/2510.24122v1Comparing Disciplinary Classifications in SSH: Organizational, Channel-Based, and Text-Based Perspectives2025-10-28T06:49:29ZThis study investigates how different approaches to disciplinary classification represent the Social Sciences and Humanities (SSH) in the Flemish VABB-SHW database. We compare organizational classification (based on author affiliation), channel-based cognitive classification (based on publication venues), and text-based publication-level classification (using channel titles, publication titles, and abstracts, depending on availability). The analysis shows that text-based classification generally aligns more closely with channel-based categories, confirming that the channel choice provides relevant information about publication content. At the same time, it is closer to organizational classification than channel-based categories are, suggesting that textual features capture author affiliations more directly than publishing channels do. Comparison across the three systems highlights cases of convergence and divergence, offering insights into how disciplines such as "Sociology" and "History" extend across fields, while "Law" remains more contained. Publication-level classification also clarifies the disciplinary profiles of multidisciplinary journals in the database, which in VABB-SHW show distinctive profiles with stronger emphases on SSH and health sciences. At the journal level, fewer than half of outlets with more than 50 publications have their channel-level classification fully or partially supported by more than 90% of publications. These results demonstrate the added value of text-based methods for validating classifications and for analysing disciplinary dynamics.2025-10-28T06:49:29ZSubmitted to ScientometricsCristina ArhiliucRaf GunsTim C. E. Engelshttp://arxiv.org/abs/2510.23146v1Fake scientific journals are here to stay2025-10-27T09:22:54ZScientific publishing is facing an alarming proliferation of fraudulent practices that threaten the integrity of research communication. The production and dissemination of fake research have become a profitable business, undermining trust in scientific journals and distorting the evaluation processes that depend on them. This brief piece examines the problem of fake journals through a three-level typology. The first level concerns predatory journals, which prioritise financial gain over scholarly quality by charging authors publication fees while providing superficial or fabricated peer review. The second level analyses hijacked journals, in which counterfeit websites impersonate legitimate titles to deceive authors into submitting and paying for publication. The third level addresses hacked journals, where legitimate platforms are compromised through cyberattacks or internal manipulation, enabling the distortion of review and publication processes. Together, these forms of misconduct expose deep vulnerabilities in the scientific communication ecosystem, exacerbated by the pressure to publish and the marketisation of research outputs. The manuscript concludes that combating these practices requires structural reforms in scientific evaluation and governance. Only by reducing the incentives that sustain the business of fraudulent publishing can the scholarly community restore credibility and ensure that scientific communication fulfils the essential purpose of reliable advancement of knowledge.2025-10-27T09:22:54Z7 pages, 1 figure. Expanded version of blog post published in SpanishEnrique Orduña-Maleahttp://arxiv.org/abs/2508.13234v2The Role of AI in Facilitating Interdisciplinary Collaboration: Evidence from AlphaFold2025-10-27T07:32:43ZThe acceleration of artificial intelligence (AI) in science is recognized and many scholars have begun to explore its role in interdisciplinary collaboration. However, the mechanisms and extent of this impact are still unclear. This study, using AlphaFold's impact on structural biologists, examines how AI technologies influence interdisciplinary collaborative patterns. By analyzing 1,247 AlphaFold-related papers and 7,700 authors from Scopus, we employ bibliometric analysis and causal inference to compare interdisciplinary collaboration between AlphaFold adopters and non-adopters. Contrary to the widespread belief that AI facilitates interdisciplinary collaboration, our findings show that AlphaFold increased structural biology-computer science collaborations by just 0.48%, with no measurable effect on other disciplines. Specifically, AI creates interdisciplinary collaboration demands with specific disciplines due to its technical characteristics, but this demand is weakened by technological democratization and other factors. These findings demonstrate that artificial intelligence (AI) alone has limited efficacy in bridging disciplinary divides or fostering meaningful interdisciplinary collaboration.2025-08-18T00:31:03Z29pages, 2figuresNaixuan ZhaoChunli WeiXinyan ZhangJiang Li