https://arxiv.org/api/7L1kw86T7+3Hh+dihxHxGvH2O3c2026-06-13T20:43:32Z606548015http://arxiv.org/abs/2511.14362v1SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature2025-11-18T11:09:19ZThe accelerating growth of scientific publications has intensified the need for scalable, trustworthy systems to synthesize knowledge across diverse literature. While recent retrieval-augmented generation (RAG) methods have improved access to scientific information, they often overlook citation graph structure, adapt poorly to complex queries, and yield fragmented, hard-to-verify syntheses. We introduce SciRAG, an open-source framework for scientific literature exploration that addresses these gaps through three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter supporting documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution. Extensive experiments across multiple benchmarks such as QASA and ScholarQA demonstrate that SciRAG outperforms prior systems in factual accuracy and synthesis quality, establishing a new foundation for reliable, large-scale scientific knowledge aggregation.2025-11-18T11:09:19ZHang DingYilun ZhaoTiansheng HuManasi PatwardhanArman Cohanhttp://arxiv.org/abs/2305.11444v2NAIST Academic Travelogue Dataset2025-11-18T01:18:57ZWe have constructed NAIST Academic Travelogue Dataset (ATD) and released it free of charge for academic research. This dataset is a Japanese text dataset with a total of over 31 million words, comprising 4,672 Japanese domestic travelogues and 9,607 overseas travelogues. Before providing our dataset, there was a scarcity of widely available travelogue data for research purposes, and each researcher had to prepare their own data. This hinders the replication of existing studies and fair comparative analysis of experimental results. Our dataset enables any researchers to conduct investigation on the same data and to ensure transparency and reproducibility in research. In this paper, we describe the academic significance, characteristics, and prospects of our dataset.2023-05-19T05:53:49ZUpdated version with revised manuscriptHiroki OuchiHiroyuki ShindoShoko WakamiyaYuki MatsudaNaoya InoueShohei HigashiyamaSatoshi NakamuraTaro Watanabehttp://arxiv.org/abs/2511.04820v2The Drain of Scientific Publishing2025-11-17T20:05:17ZThe domination of scientific publishing in the Global North by major commercial publishers is harmful to science. We need the most powerful members of the research community, funders, governments and Universities, to lead the drive to re-communalise publishing to serve science not the market.2025-11-06T21:20:22Z1 Figure, 1 Table, 1 Supplementary TableFernanda BeigelDan BrockingtonPaolo CrosettoGemma DerrickAileen FyfePablo Gomez BarreiroMark A. HansonStefanie HausteinVincent LarivièreChristine NoeStephen PinfieldJames Wilsdonhttp://arxiv.org/abs/2511.11447v2GRIN Transfer: A production-ready tool for libraries to retrieve digital copies from Google Books2025-11-17T15:20:14ZPublicly launched in 2004, the Google Books project has scanned tens of millions of items in partnership with libraries around the world. As part of this project, Google created the Google Return Interface (GRIN). Through this platform, libraries can access their scanned collections, the associated metadata, and the ongoing OCR and metadata improvements that become available as Google reprocesses these collections using new technologies. When downloading the Harvard Library Google Books collection from GRIN to develop the Institutional Books dataset, we encountered several challenges related to rate-limiting and atomized metadata within the GRIN platform. To overcome these challenges and help other libraries make more robust use of their Google Books collections, this technical report introduces the initial release of GRIN Transfer. This open-source and production-ready Python pipeline allows partner libraries to efficiently retrieve their Google Books collections from GRIN. This report also introduces an updated version of our Institutional Books 1.0 pipeline, initially used to analyze, augment, and assemble the Institutional Books 1.0 dataset. We have revised this pipeline for compatibility with the output format of GRIN Transfer. A library could pair these two tools to create an end-to-end processing pipeline for their Google Books collection to retrieve, structure, and enhance data available from GRIN. This report gives an overview of how GRIN Transfer was designed to optimize for reliability and usability in different environments, as well as guidance on configuration for various use cases.2025-11-14T16:16:04ZLiza DalyMatteo CargneluttiCatherine BrobstonJohn HessGreg LeppertAmanda WatsonJonathan Zittrainhttp://arxiv.org/abs/2511.13378v1Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models2025-11-17T13:52:23ZDiagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce's semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.2025-11-17T13:52:23ZCarlo Teo PedrettiDavide PiccaDario Rodighierohttp://arxiv.org/abs/2511.13801v1Rdgai: Classifying transcriptional changes using Large Language Models with a test case from an Arabic Gospel tradition2025-11-17T10:03:12ZApplication of phylogenetic methods to textual traditions has traditionally treated all changes as equivalent even though it is widely recognized that certain types of variants were more likely to be introduced than others. While it is possible to give weights to certain changes using a maximum parsimony evaluation criterion, it is difficult to state a priori what these weights should be. Probabilistic methods, such as Bayesian phylogenetics, allow users to create categories of changes, and the transition rates for each category can be estimated as part of the analysis. This classification of types of changes in readings also allows for inspecting the probability of these categories across each branch in the resulting trees. However, classification of readings is time-consuming, as it requires categorizing each reading against every other reading at each variation unit, presenting a significant barrier to entry for this kind of analysis. This paper presents Rdgai, a software package that automates this classification task using multi-lingual large language models (LLMs). The tool allows users to easily manually classify changes in readings and then it uses these annotations in the prompt for an LLM to automatically classify the remaining reading transitions. These classifications are stored in TEI XML and ready for downstream phylogenetic analysis. This paper demonstrates the application with data an Arabic translation of the Gospels.2025-11-17T10:03:12Z8 figuresRobert Turnbullhttp://arxiv.org/abs/2506.15959v4Can Recombination Displace Dominant Scientific Ideas2025-11-16T07:22:59ZThe combination of diverse, pre-existing knowledge is a common explanation for scientific breakthroughs. However, a paradox exists: while scientific output and the potential for such recombination have grown exponentially, the rate of breakthrough discoveries has not. To explore this paradox, our study examines 41 million scientific articles from 1965 to 2024. We measure two key properties for each paper: atypicality, which quantifies the combination of knowledge from conceptually distant areas, and disruption. We demonstrate that these metrics capture distinct processes. Atypicality is characteristic of work that extends established concepts into new topical areas (a form of cross-topic recombination). Disruption, in contrast, signifies the replacement of a dominant idea within its own topic.2025-06-19T02:00:46Z4 figuresLinzhuo LiYiling LinLingfei Wuhttp://arxiv.org/abs/2511.12277v1DataOps-driven CI/CD for analytics repositories2025-11-15T16:09:47ZThe proliferation of SQL for data processing has often occurred without the rigor of traditional software development, leading to siloed efforts, logic replication, and increased risk. This ad-hoc approach hampers data governance and makes validation nearly impossible. Organizations are adopting DataOps, a methodology combining Agile, Lean, and DevOps principles to address these challenges to treat analytics pipelines as production systems. However, a standardized framework for implementing DataOps is lacking. This perspective proposes a qualitative design for a DataOps-aligned validation framework. It introduces a DataOps Controls Scorecard, derived from a multivocal literature review, which distills key concepts into twelve testable controls. These controls are then mapped to a modular, extensible CI/CD pipeline framework designed to govern a single source of truth (SOT) SQL repository. The framework consists of five stages: Lint, Optimize, Parse, Validate, and Observe, each containing specific, automated checks. A Requirements Traceability Matrix (RTM) demonstrates how each high-level control is enforced by concrete pipeline checks, ensuring qualitative completeness. This approach provides a structured mechanism for enhancing data quality, governance, and collaboration, allowing teams to scale analytics development with transparency and control.2025-11-15T16:09:47ZDmytro Valiaievhttp://arxiv.org/abs/2411.15218v2Academ-AI: documenting the undisclosed use of generative artificial intelligence in academic publishing2025-11-15T15:24:36ZSince generative artificial intelligence (AI) tools such as OpenAI's ChatGPT became widely available, researchers have used them in the writing process. The consensus of the academic publishing community is that such usage must be declared in the published article. Academ-AI documents examples of suspected undeclared AI usage in the academic literature, discernible primarily due to the appearance in research papers of idiosyncratic verbiage characteristic of large language model (LLM)-based chatbots. This analysis of the first 768 examples collected reveals that the problem is widespread, penetrating the journals, conference proceedings, and textbooks of highly respected publishers. Undeclared AI seems to appear in journals with higher citation metrics and higher article processing charges (APCs), precisely those outlets that should theoretically have the resources and expertise to avoid such oversights. An extremely small minority of cases are corrected post publication, and the corrections are often insufficient to rectify the problem. The 768 examples analyzed here likely represent a small fraction of the undeclared AI present in the academic literature, much of which may be undetectable. Publishers must enforce their policies against undeclared AI usage in cases that are detectable; this is the best defense currently available to the academic publishing community against the proliferation of undisclosed AI. This is an updated version of a previous preprint.2024-11-20T21:29:36Z24 pages, 8 figuresAlex Glynnhttp://arxiv.org/abs/2511.13773v1PRITES: An integrative framework for investigating and assessing web-scraped HTTP-response datasets for research applications2025-11-15T08:17:54ZThe ability to programmatically retrieve vast quantities of data from online sources has given rise to increasing usage of web-scraped datasets for various purposes across government, industry and academia. Contemporaneously, there has also been growing discussion about the statistical qualities and limitations of collecting from online data sources and analysing web-scraped datasets. However, literature on web-scraping is distributed across computer science, statistical methodology and application domains, with distinct and occasionally conflicting definitions of web-scraping and conceptualisations of web-scraped data quality. This work synthesises technical and statistical concepts, best practices and insights across these relevant disciplines to inform documentation during web-scraping processes, and quality assessment of the resultant web-scraped datasets.
We propose an integrated framework to cover multiple processes during the creation of web-scraped datasets including 'Plan', 'Retrieve', 'Investigate', 'Transform', 'Evaluate' and 'Summarise' (PRITES). The framework groups related quality factors which should be monitored during the collection of new web-scraped data, and/or investigated when assessing potential applications of existing web-scraped datasets. We connect each stage to existing discussions of technical and statistical challenges in collecting and analysing web-scraped data. We then apply the framework to describe related work by the co-authors to adapt web-scraped retail prices for alcoholic beverages collected by an industry data partner into analysis-ready datasets for public health policy research. The case study illustrates how the framework supports accurate and comprehensive scientific reporting of studies using web-scraped datasets.2025-11-15T08:17:54ZCynthia A. HuangTina Lamhttp://arxiv.org/abs/2511.01211v3Novelty and Impact of Economics Papers2025-11-15T02:55:52ZWe propose a framework that recasts scientific novelty not as a single attribute of a paper, but as a reflection of its position within the evolving intellectual landscape. We decompose this position into two orthogonal dimensions: \textit{spatial novelty}, which measures a paper's intellectual distinctiveness from its neighbors, and \textit{temporal novelty}, which captures its engagement with a dynamic research frontier. To operationalize these concepts, we leverage Large Language Models to develop semantic isolation metrics that quantify a paper's location relative to the full-text literature. Applying this framework to a large corpus of economics articles, we uncover a fundamental trade-off: these two dimensions predict systematically different outcomes. Temporal novelty primarily predicts citation counts, whereas spatial novelty predicts disruptive impact. This distinction allows us to construct a typology of semantic neighborhoods, identifying four archetypes associated with distinct and predictable impact profiles. Our findings demonstrate that novelty can be understood as a multidimensional construct whose different forms, reflecting a paper's strategic location, have measurable and fundamentally distinct consequences for scientific progress.2025-11-03T04:12:03ZChaofeng Wuhttp://arxiv.org/abs/2511.13754v1The current state of open access2025-11-14T03:43:07ZThis article presents reflections from the perspective of a university librarian involved in the establishment and management of institutional repositories in Japan. It examines the historical evolution of scholarly communication, from the oral exchanges of ancient Greek philosophers, through the advent of printing and the rise of academic journals, to the contemporary digital era. The origins of the open access movement are emphasized as rooted in authors' desire to disseminate knowledge globally, rather than merely opening access for readers. The article critically discusses current practices in Japan, including institutional repositories, open access journals, and "read-and-publish" agreements, highlighting that many digital innovations still imitate the conventions of print-based scholarly communication. Furthermore, it explores the challenges and opportunities posed by electronic information dissemination, including the limitations of the Version of Record and the potential of diamond open access models. The article argues that genuine progress in scholarly communication requires rethinking publication practices, embracing the modifiability of digital content, and developing new communication models fully suited to the internet era. Ultimately, the goal is to realize a global, real-time, trustworthy scholarly communication system that transcends the historical constraints of print media.2025-11-14T03:43:07ZThe Journal of Information Science and Technology Association, Vol.75, No. 9 (2025) (Special Issue: The Future of Open Access: Preparing for Japan's Immediate OA Mandate)Shigeki Sugita10.18919/jkg.75.9_417http://arxiv.org/abs/2511.07468v2crate2bib: Citing Rust crates made easy2025-11-13T19:39:19Zcrate2bib is a collection of tools designed to convert Rust crates hosted on crates.io into bibliography entries. It queries the server, extracts metadata from the given crate and also searches for possible CITATION.cff files within the repository that hosts the code of the crate in interest. From this information, it formats the provided information such as name, version, authors and generates entries for all available candidates. With this approach, crates can be cited easily and existing citations for published crates can be found. The tool can be used as a webapp, python package, command-line utility or Rust crate.2025-11-07T21:00:05ZSee the Github repository: https://github.com/jonaspleyer/crate2bib/Jonas Pleyerhttp://arxiv.org/abs/2511.10722v1Practical Author Name Disambiguation under Metadata Constraints: A Contrastive Learning Approach for Astronomy Literature2025-11-13T19:00:00ZThe ability to distinctly and properly collate an individual researcher's publications is crucial for ensuring appropriate recognition, guiding the allocation of research funding and informing hiring decisions. However, accurately grouping and linking a researcher's entire body of work with their individual identity is challenging because of widespread name ambiguity across the growing literature. Algorithmic author name disambiguation provides a scalable approach to disambiguating author identities, yet existing methods have limitations. Many modern author name disambiguation methods rely on comprehensive metadata features such as venue or affiliation. Despite advancements in digitally indexing publications, metadata is often unavailable or inconsistent in large digital libraries(e.g. NASA/ADS). We introduce the Neural Author Name Disambiguator, a method that disambiguates author identities in large digital libraries despite limited metadata availability. We formulate the disambiguation task as a similarity learning problem by employing a Siamese neural network to disambiguate author names across publications relying solely on widely available publication metadata-author names, titles and abstracts. We construct the Large-Scale Physics ORCiD Linked dataset to evaluate the Neural Author Name Disambiguator by cross-matching NASA/ADS publications ORCiD. By leveraging foundation models to embed metadata into features, our model achieves up to 94% accuracy in pairwise disambiguation and over 95% F1 in clustering publications into their researcher identities. We release the testing dataset as a benchmark for physics and astronomy, providing realistic evaluation conditions for future disambiguation methods. The Neural Author Name Disambiguator algorithm demonstrates effective disambiguation with minimal metadata, offering a scalable solution for name ambiguity in large digital libraries.2025-11-13T19:00:00ZVicente Amado OlivoWolfgang KerzendorfBangjing LuJoshua V. ShieldsAndreas FlörsNutan Chenhttp://arxiv.org/abs/2511.11752v1Towards autonomous quantum physics research using LLM agents with access to intelligent tools2025-11-13T18:18:58ZArtificial intelligence (AI) is used in numerous fields of science, yet the initial research questions and targets are still almost always provided by human researchers. AI-generated creative ideas in science are rare and often vague, so that it remains a human task to execute them. Automating idea generation and implementation in one coherent system would significantly shift the role of humans in the scientific process. Here we present AI-Mandel, an LLM agent that can generate and implement ideas in quantum physics. AI-Mandel formulates ideas from the literature and uses a domain-specific AI tool to turn them into concrete experiment designs that can readily be implemented in laboratories. The generated ideas by AI-Mandel are often scientifically interesting - for two of them we have already written independent scientific follow-up papers. The ideas include new variations of quantum teleportation, primitives of quantum networks in indefinite causal orders, and new concepts of geometric phases based on closed loops of quantum information transfer. AI-Mandel is a prototypical demonstration of an AI physicist that can generate and implement concrete, actionable ideas. Building such a system is not only useful to accelerate science, but it also reveals concrete open challenges on the path to human-level artificial scientists.2025-11-13T18:18:58Z24 pages, 5 figuresSören ArltXuemei GuMario Krenn