https://arxiv.org/api/D8AwQeQyVwO1QMFYD2Xxr7kG7nE 2026-06-10T00:31:32Z 6057 60 15 http://arxiv.org/abs/2605.18410v1 From Node2Vec to GPT-based GraphRAG: scientific impact prediction across graph and language models 2026-05-18T13:48:58Z

Identifying which newly published scientific papers are likely to become highly cited is important for prioritizing research attention, supporting editorial decisions, and guiding the allocation of scientific resources, particularly under cold-start conditions where little direct evidence is available at publication time. In this work, we formulate impact prediction as a cohort-normalized top-P% classification task and compare graph-based and LLM-based approaches under a unified framework. We construct citation and textual-similarity graphs under temporal constraints and generate Node2Vec representations, either alone or combined with OpenAI text embeddings. The best supervised configuration combines directed citation graphs with textual embeddings, reaching approximately 0.84-0.85 AUC. We also evaluate a GPT-based GraphRAG setup, using GPT 5.5 and 5.4 Nano, in which graph neighborhoods are used as contextual evidence for prediction. Although the LLM-based approach achieves high performance, retrieved context does not consistently improve results; target-only prompts often perform as well as or better than GraphRAG prompts achieving the 0.87 mark. These findings indicate that structural and textual signals are complementary for supervised prediction, while retrieval augmentation must be carefully evaluated against simpler LLM baselines.

2026-05-18T13:48:58Z Adilson Vital Filipi N. Silva Diego R. Amancio http://arxiv.org/abs/2507.18406v2 Factual Inconsistencies in Multilingual Wikipedia Tables 2026-05-18T12:50:56Z

Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

2025-07-24T13:46:14Z 11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025 Silvia Cappa Lingxiao Kong Pille-Riin Peet Fanfu Wei Yuchen Zhou Jan-Christoph Kalo http://arxiv.org/abs/2511.08639v3 The Journal of Prompt-Engineered (Moral) Philosophy Or: Why AI-Assisted Ethics Research Requires Process Transparency 2026-05-18T11:23:04Z

Existing AI disclosure mandates in scholarship require that AI assistance be reported but leave transparency philosophically unspecified: they fix the duty without explaining what the duty serves. We argue that ethical inquiry is essentially contested at two independent levels -- about what it is, and about what it demands of the inquirer -- defeating output-only evaluation and welfare-economic dismissal of the transparency question, and, by extension, reproducibility framings imported from the empirical sciences. The transparency duty is grounded instead in agent-integrity: the legibility, before a community of inquiry, of the identity-constituting commitments that the author's mode of philosophising expresses. Because the standards for evaluating such work are not communally settled, the achievable goal for transparency is not evaluation against agreed criteria but tracking -- accumulating the evidentiary record that lets each tradition assess the work on its own terms and makes future normative judgments possible. We develop a documentation-adequacy framework that operationalises Meaningful Human Control through five transparency elements -- declaration, navigation, documentation account, process documentation, and development records -- demonstrated by the paper itself, whose full documentation record is archived at a persistent identifier. The framework is a first iteration subject to revision, not a settled standard.

2025-11-10T08:56:21Z 21 pages Transparency material documenting LLM usage available at: https://github.com/MicheleLoi/JPEP/tree/main/transparency/Canonical_MD Michele Loi http://arxiv.org/abs/2605.21517v1 The Ephemeral Web and the Case for Proactive Archiving 2026-05-18T04:10:32Z

The web is often treated as a durable record of institutional and social life, yet in practice it is fragile, revisable, and frequently ephemeral. Domains change, redesigns erase earlier material, institutions relocate, maintainers graduate, platforms impose silent limits, and periods of political instability can interrupt digital access entirely. This paper argues that archiving should not remain a niche activity practiced by a few specialists at the margins, but should become a proactive part of website maintenance. I motivate this claim through a case study centered on the Pakistan Embassy International School and College Tehran, whose domain, visual identity, leadership, and physical location all changed within a short period after my graduation. In response, I built and deployed a lightweight automated archival system using Python and GitHub Actions to submit pages and media from the site to the Internet Archive's Wayback Machine. The project shows both that archival preservation can be automated with modest infrastructure and that archival systems are themselves vulnerable to interruption, as illustrated by GitHub's automatic disabling of scheduled workflows after repository inactivity. Drawing on personal experience with internet shutdowns in Iran, open-source sustainability lessons from RPI's RCOS, and the operational history of the archiver, I argue that the ephemerality of the web is not an exception but a structural condition. If digital societies wish to preserve institutional memory and public history without leaving preservation to chance, proactive archiving should become a commonplace part of website maintenance.

2026-05-18T04:10:32Z 9 pages, 1 algorithm Meliksah Yorulmazlar http://arxiv.org/abs/2605.17023v1 Sum of rank ratios: an alternative to percentiles for research assessment, from groundbreaking to mainstream research 2026-05-16T14:51:37Z

Assessing research that pushes the boundaries of knowledge is challenging because such work is extremely infrequent, accounting for only about 0.01 per cent of all research outputs. Consequently, knowledge about how to evaluate this type of research is far more limited than the well established methods used to assess more common research outcomes. This study addresses this gap by using a rank based approach in which each paper is assigned a unique value equal to the ratio between its local and global ranks. The cumulative value of these ratios, starting from the most cited paper, provides the evaluative basis, and the Rn index described here, using 10 rank ratios, appears to be the best option. Although research assessment based on global ranks was originally developed to evaluate the largest contributors to groundbreaking knowledge, namely, the USA and China, which account for most of the most cited papers, the Rn index has broader applications. This study demonstrates that it is also a better option than the number of top 10 per cent or top 1per cent highly cited papers, which are the most common indicators used to evaluate countries that seldom or never produce cutting-edge research that pushes the boundaries of knowledge. In all cases, the Rn index reflects the highest quality science produced by each country. Furthermore, the Rn index can be easily calculated without specialized training in bibliometrics and is insignificantly affected by ties in citation counts.

2026-05-16T14:51:37Z Alonso Rodriguez-Navarro http://arxiv.org/abs/2605.16562v1 Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 4 2026-05-15T19:04:45Z

We report on the ongoing development of arXiv's HTML Papers offering, available on every new TeX/LaTeX submission since its initial release in 2023. The main highlights from 2025 and early 2026 are: (i) community-driven improvements to HTML fidelity and service health, with roughly half of 6,000 user reports resolved; (ii) corpus-scale conversion work aimed at 90% error-free HTML (currently 75%); (iii) initial MathML 4 Intent annotations for accessible speech output; (iv) an in-progress Rust port of LaTeXML, reducing compute costs and enabling faster previews on submission. The arXiv HTML Papers project remains experimental, but is gradually maturing as we better understand the needs of arXiv's readers and the technical opportunities presented by new standards and by advances in programming languages and AI.

2026-05-15T19:04:45Z 6 pages, ICMS 2026 Deyan Ginev Brian Caruso Bruce Miller Jeff Sank Jacob Weiskoff http://arxiv.org/abs/2605.16194v1 paper.json: A Coordination Convention for LLM-Agent-Actionable Papers 2026-05-15T17:10:50Z

LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json --against paper.typ` passes. Repo: https://github.com/arquicanedo/paper-json

2026-05-15T17:10:50Z Arquimedes Canedo http://arxiv.org/abs/2605.16475v1 Generative Artificial Intelligence for Literature Reviews 2026-05-15T15:42:54Z

Generative artificial intelligence (GenAI), based on large-language models (LLMs), such as ChatGPT, has taken organizations, academia, and the public by storm. In particular, impressive GenAI capabilities such as summarization of large text corpora, question-answering, data extraction, and translation, carry profound implications for the conduct of literature reviews. This impacts science, organizations and the general public, as all can benefit from GenAI-supported literature reviews. Building on the technical foundations of GenAI and grounded in established methodological discourse, this work outlines approaches for conducting literature reviews using both general-purpose (e.g., ChatGPT, Gemini, Claude) and specialized GenAI tools (e.g., Consensus, Elicit). We provide illustrative examples of prompts and suggest methodologically-sound literature review strategies. Throughout this perspective paper, we adopt a balanced approach considering both the opportunities and the risks of relying on GenAI in the conduct of literature reviews. We conclude by discussing philosophical questions related to the effects of GenAI on long-term scientific progress, and also present fruitful opportunities for research on improving the core of GenAI's technology-its architecture and training data-and suggest open issues in GenAI-based literature reviews methodology.

2026-05-15T15:42:54Z Journal of Information Technology, 02683962261425675 (2026) Gerit Wagner Julian Prester Reza Mousavi Roman Lukyanenko Guy Pare 10.1177/02683962261425675 http://arxiv.org/abs/2605.15362v1 Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering 2026-05-14T19:42:20Z

Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

2026-05-14T19:42:20Z 15 pages, 7 figures, 2 tables, 21 references Volodymyr Ovcharov http://arxiv.org/abs/2605.02651v2 ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review 2026-05-14T18:08:35Z

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.

2026-05-04T14:34:36Z Kevin Riehl Andres L. Marin Nikofors Zacharof Fan Wu Patrick Langer Robert Jakob Anastasios Kouvelas Georgios Fontaras Michail A. Makridis http://arxiv.org/abs/2605.15079v1 Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets 2026-05-14T17:04:39Z

Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

2026-05-14T17:04:39Z 23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-baker Rafi Al Attrach Rajna Fani Sebastian Lobentanzer Joan Giner-Miguelez Debanshu Das Varuni H. K. Nobin Sarwar Rajat Ghosh Anwai Archit Surbhi Motghare Christina Conrad Parry Luis Oala Lara Grosso Joaquin Vanschoren Steffen Vogler Sujata Goswami Eric S. Rosenthal Marzyeh Ghassemi Matthew McDermott Tom Pollard http://arxiv.org/abs/2605.14722v1 A Template-Driven Platform for Contextualised Researcher Profiles 2026-05-14T11:45:37Z

Modern researchers engage in diverse activities, assume multiple contribution roles, and produce a variety of outputs beyond traditional publications. This broader view of research contributions is increasingly recognised by responsible research assessment initiatives. However, existing researcher profiling platforms remain largely focused on publications and publication-centric indicators, offering limited support for contextualised and multi-dimensional representations of research careers. This paper presents BIP! Scholar, a platform that supports flexible researcher profiling through a template-driven approach. Researchers can create profiles tailored to different presentation or assessment contexts using track-based, narrative-style, or hybrid templates which support the representation of diverse outputs, contribution roles, and broader research activities. The platform also supports research assessment experts who wish to design and evaluate experimental profile templates.

2026-05-14T11:45:37Z Serafeim Chatzopoulos Paris Koloveas Kleanthis Vichos Dionysis Diamantis Thanasis Vergoulis http://arxiv.org/abs/2602.15249v2 Artificial Intelligence Specialization in the European Union: Underexplored Role of the Periphery at NUTS-3 Level 2026-05-14T00:30:42Z

This study examines the distribution of Artificial Intelligence (AI) research across European NUTS-3 regions during the period 2015-2024. Using bibliometric data from Clarivate InCites and the Citation Topics classification system, we analyse two hierarchical thematic levels: Electrical Engineering, Electronics & Computer Science (Macro Citation Topic 4) and Artificial Intelligence & Machine Learning (Meso Citation Topic 4.61). Relative Specialization Index (RSI) and Relative Citation Impact (RCI) indicators are calculated for 781 European NUTS-3 regions. While major metropolitan hubs such as Paris, Warszawa, and Madrid dominate in absolute publication volume, the results reveal that the highest levels of relative AI specialization are concentrated in peripheral regions, particularly in Eastern Europe and Spain. Granada and Vilniaus apskritis stand out as regions combining high specialization with strong citation visibility. The analysis further suggests a weak relationship between regional specialization and citation impact, revealing multiple regional profiles, including highly specialized regions with limited citation visibility, highly visible regions with comparatively low specialization, and diversified scientific systems combining moderate specialization with strong citation impact. Fyn emerges as an extreme case of very high citation impact despite relatively low specialization.

2026-02-16T23:01:14Z 15 pages, 3 figures Victor Herrero-Solana Carmen Gálvez http://arxiv.org/abs/2605.14188v1 QOuLiPo: What a quantum computer sees when it reads a book 2026-05-13T23:10:15Z

What does a book look like to a quantum computer? This paper takes eight classical works of the Renaissance and its late-antique inheritance -- from Augustine to Galileo -- and runs each through a neutral-atom quantum processor. The bridge is graphs: each textual unit becomes an atom, and graph edges are physical blockade constraints for engineered exact unit-disk designs, or a 2D approximation to the semantic graph for natural texts. Three contributions follow. First, we introduce rigidity rho, a metric for how unique a book's structural backbone is -- distinguishing Marguerite de Navarre's Heptameron (rigid, twelve-nouvelle hard core) from Boethius (fully fungible, every chapter substitutable). Second, we invert the pipeline: rather than extracting a graph from existing prose, we pick a target graph the hardware encodes natively, and write a book whose structure matches it. The twenty-nine texts written this way, collected under the name QOuLiPo, extend the OuLiPo tradition to graph-topological constraints and, together with the eight natural texts, form a benchmark distribution against which neutral-atom hardware can be tracked as it scales. Third, we run both natural and engineered texts on Pasqal's FRESNEL processor up to one hundred atoms; engineered texts reach high approximation ratios, the cleanest instances returning the exact backbone. A cloud-accessible quantum machine plus an agentic coding environment now lets a single investigator run this pipeline end-to-end. What is reported is an application layer, not a speedup -- humanistic instances ready to load onto neutral-atom processors as they scale, already complementing classical text analysis. The Digital Humanities community has a stake in building familiarity with this hardware now: the engineered-corpus design choices made today fix the benchmark distribution future hardware will be measured against.

2026-05-13T23:10:15Z Christophe Jurczak http://arxiv.org/abs/2605.13310v1 SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem 2026-05-13T10:25:43Z

We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.

2026-05-13T10:25:43Z Abdul Rafay Yuni Susanti David Lamprecht Michael Färber