https://arxiv.org/api/D8AwQeQyVwO1QMFYD2Xxr7kG7nE2026-06-10T00:31:32Z60576015http://arxiv.org/abs/2605.18410v1From Node2Vec to GPT-based GraphRAG: scientific impact prediction across graph and language models2026-05-18T13:48:58ZIdentifying which newly published scientific papers are likely to become highly cited is important for prioritizing research attention, supporting editorial decisions, and guiding the allocation of scientific resources, particularly under cold-start conditions where little direct evidence is available at publication time. In this work, we formulate impact prediction as a cohort-normalized top-P% classification task and compare graph-based and LLM-based approaches under a unified framework. We construct citation and textual-similarity graphs under temporal constraints and generate Node2Vec representations, either alone or combined with OpenAI text embeddings. The best supervised configuration combines directed citation graphs with textual embeddings, reaching approximately 0.84-0.85 AUC. We also evaluate a GPT-based GraphRAG setup, using GPT 5.5 and 5.4 Nano, in which graph neighborhoods are used as contextual evidence for prediction. Although the LLM-based approach achieves high performance, retrieved context does not consistently improve results; target-only prompts often perform as well as or better than GraphRAG prompts achieving the 0.87 mark. These findings indicate that structural and textual signals are complementary for supervised prediction, while retrieval augmentation must be carefully evaluated against simpler LLM baselines.2026-05-18T13:48:58ZAdilson VitalFilipi N. SilvaDiego R. Amanciohttp://arxiv.org/abs/2507.18406v2Factual Inconsistencies in Multilingual Wikipedia Tables2026-05-18T12:50:56ZWikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.2025-07-24T13:46:14Z11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025Silvia CappaLingxiao KongPille-Riin PeetFanfu WeiYuchen ZhouJan-Christoph Kalohttp://arxiv.org/abs/2511.08639v3The Journal of Prompt-Engineered (Moral) Philosophy Or: Why AI-Assisted Ethics Research Requires Process Transparency2026-05-18T11:23:04ZExisting AI disclosure mandates in scholarship require that AI assistance be reported but leave transparency philosophically unspecified: they fix the duty without explaining what the duty serves. We argue that ethical inquiry is essentially contested at two independent levels -- about what it is, and about what it demands of the inquirer -- defeating output-only evaluation and welfare-economic dismissal of the transparency question, and, by extension, reproducibility framings imported from the empirical sciences. The transparency duty is grounded instead in agent-integrity: the legibility, before a community of inquiry, of the identity-constituting commitments that the author's mode of philosophising expresses. Because the standards for evaluating such work are not communally settled, the achievable goal for transparency is not evaluation against agreed criteria but tracking -- accumulating the evidentiary record that lets each tradition assess the work on its own terms and makes future normative judgments possible. We develop a documentation-adequacy framework that operationalises Meaningful Human Control through five transparency elements -- declaration, navigation, documentation account, process documentation, and development records -- demonstrated by the paper itself, whose full documentation record is archived at a persistent identifier. The framework is a first iteration subject to revision, not a settled standard.2025-11-10T08:56:21Z21 pages Transparency material documenting LLM usage available at: https://github.com/MicheleLoi/JPEP/tree/main/transparency/Canonical_MDMichele Loihttp://arxiv.org/abs/2605.21517v1The Ephemeral Web and the Case for Proactive Archiving2026-05-18T04:10:32ZThe web is often treated as a durable record of institutional and social life, yet in practice it is fragile, revisable, and frequently ephemeral. Domains change, redesigns erase earlier material, institutions relocate, maintainers graduate, platforms impose silent limits, and periods of political instability can interrupt digital access entirely. This paper argues that archiving should not remain a niche activity practiced by a few specialists at the margins, but should become a proactive part of website maintenance. I motivate this claim through a case study centered on the Pakistan Embassy International School and College Tehran, whose domain, visual identity, leadership, and physical location all changed within a short period after my graduation. In response, I built and deployed a lightweight automated archival system using Python and GitHub Actions to submit pages and media from the site to the Internet Archive's Wayback Machine. The project shows both that archival preservation can be automated with modest infrastructure and that archival systems are themselves vulnerable to interruption, as illustrated by GitHub's automatic disabling of scheduled workflows after repository inactivity. Drawing on personal experience with internet shutdowns in Iran, open-source sustainability lessons from RPI's RCOS, and the operational history of the archiver, I argue that the ephemerality of the web is not an exception but a structural condition. If digital societies wish to preserve institutional memory and public history without leaving preservation to chance, proactive archiving should become a commonplace part of website maintenance.2026-05-18T04:10:32Z9 pages, 1 algorithmMeliksah Yorulmazlarhttp://arxiv.org/abs/2605.17023v1Sum of rank ratios: an alternative to percentiles for research assessment, from groundbreaking to mainstream research2026-05-16T14:51:37ZAssessing research that pushes the boundaries of knowledge is challenging because such work is extremely infrequent, accounting for only about 0.01 per cent of all research outputs. Consequently, knowledge about how to evaluate this type of research is far more limited than the well established methods used to assess more common research outcomes. This study addresses this gap by using a rank based approach in which each paper is assigned a unique value equal to the ratio between its local and global ranks. The cumulative value of these ratios, starting from the most cited paper, provides the evaluative basis, and the Rn index described here, using 10 rank ratios, appears to be the best option. Although research assessment based on global ranks was originally developed to evaluate the largest contributors to groundbreaking knowledge, namely, the USA and China, which account for most of the most cited papers, the Rn index has broader applications. This study demonstrates that it is also a better option than the number of top 10 per cent or top 1per cent highly cited papers, which are the most common indicators used to evaluate countries that seldom or never produce cutting-edge research that pushes the boundaries of knowledge. In all cases, the Rn index reflects the highest quality science produced by each country. Furthermore, the Rn index can be easily calculated without specialized training in bibliometrics and is insignificantly affected by ties in citation counts.2026-05-16T14:51:37ZAlonso Rodriguez-Navarrohttp://arxiv.org/abs/2605.16562v1Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 42026-05-15T19:04:45ZWe report on the ongoing development of arXiv's HTML Papers offering, available on every new TeX/LaTeX submission since its initial release in 2023.
The main highlights from 2025 and early 2026 are:
(i) community-driven improvements to HTML fidelity and service health, with roughly half of 6,000 user reports resolved;
(ii) corpus-scale conversion work aimed at 90% error-free HTML (currently 75%);
(iii) initial MathML 4 Intent annotations for accessible speech output;
(iv) an in-progress Rust port of LaTeXML, reducing compute costs and enabling faster previews on submission.
The arXiv HTML Papers project remains experimental, but is gradually maturing as we better understand the needs of arXiv's readers and the technical opportunities presented by new standards and by advances in programming languages and AI.2026-05-15T19:04:45Z6 pages, ICMS 2026Deyan GinevBrian CarusoBruce MillerJeff SankJacob Weiskoffhttp://arxiv.org/abs/2605.16194v1paper.json: A Coordination Convention for LLM-Agent-Actionable Papers2026-05-15T17:10:50ZLLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json --against paper.typ` passes. Repo: https://github.com/arquicanedo/paper-json2026-05-15T17:10:50ZArquimedes Canedohttp://arxiv.org/abs/2605.16475v1Generative Artificial Intelligence for Literature Reviews2026-05-15T15:42:54ZGenerative artificial intelligence (GenAI), based on large-language models (LLMs), such as ChatGPT, has taken organizations, academia, and the public by storm. In particular, impressive GenAI capabilities such as summarization of large text corpora, question-answering, data extraction, and translation, carry profound implications for the conduct of literature reviews. This impacts science, organizations and the general public, as all can benefit from GenAI-supported literature reviews. Building on the technical foundations of GenAI and grounded in established methodological discourse, this work outlines approaches for conducting literature reviews using both general-purpose (e.g., ChatGPT, Gemini, Claude) and specialized GenAI tools (e.g., Consensus, Elicit). We provide illustrative examples of prompts and suggest methodologically-sound literature review strategies. Throughout this perspective paper, we adopt a balanced approach considering both the opportunities and the risks of relying on GenAI in the conduct of literature reviews. We conclude by discussing philosophical questions related to the effects of GenAI on long-term scientific progress, and also present fruitful opportunities for research on improving the core of GenAI's technology-its architecture and training data-and suggest open issues in GenAI-based literature reviews methodology.2026-05-15T15:42:54ZJournal of Information Technology, 02683962261425675 (2026)Gerit WagnerJulian PresterReza MousaviRoman LukyanenkoGuy Pare10.1177/02683962261425675http://arxiv.org/abs/2605.15362v1Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering2026-05-14T19:42:20ZHalf a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]).
Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes.
The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.2026-05-14T19:42:20Z15 pages, 7 figures, 2 tables, 21 referencesVolodymyr Ovcharovhttp://arxiv.org/abs/2605.02651v2ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review2026-05-14T18:08:35ZScientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.2026-05-04T14:34:36ZKevin RiehlAndres L. MarinNikofors ZacharofFan WuPatrick LangerRobert JakobAnastasios KouvelasGeorgios FontarasMichail A. Makridishttp://arxiv.org/abs/2605.15079v1Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets2026-05-14T17:04:39ZCroissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.2026-05-14T17:04:39Z23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-bakerRafi Al AttrachRajna FaniSebastian LobentanzerJoan Giner-MiguelezDebanshu DasVaruni H. K.Nobin SarwarRajat GhoshAnwai ArchitSurbhi MotghareChristina Conrad ParryLuis OalaLara GrossoJoaquin VanschorenSteffen VoglerSujata GoswamiEric S. RosenthalMarzyeh GhassemiMatthew McDermottTom Pollardhttp://arxiv.org/abs/2605.14722v1A Template-Driven Platform for Contextualised Researcher Profiles2026-05-14T11:45:37ZModern researchers engage in diverse activities, assume multiple contribution roles, and produce a variety of outputs beyond traditional publications. This broader view of research contributions is increasingly recognised by responsible research assessment initiatives. However, existing researcher profiling platforms remain largely focused on publications and publication-centric indicators, offering limited support for contextualised and multi-dimensional representations of research careers. This paper presents BIP! Scholar, a platform that supports flexible researcher profiling through a template-driven approach. Researchers can create profiles tailored to different presentation or assessment contexts using track-based, narrative-style, or hybrid templates which support the representation of diverse outputs, contribution roles, and broader research activities. The platform also supports research assessment experts who wish to design and evaluate experimental profile templates.2026-05-14T11:45:37ZSerafeim ChatzopoulosParis KoloveasKleanthis VichosDionysis DiamantisThanasis Vergoulishttp://arxiv.org/abs/2602.15249v2Artificial Intelligence Specialization in the European Union: Underexplored Role of the Periphery at NUTS-3 Level2026-05-14T00:30:42ZThis study examines the distribution of Artificial Intelligence (AI) research across European NUTS-3 regions during the period 2015-2024. Using bibliometric data from Clarivate InCites and the Citation Topics classification system, we analyse two hierarchical thematic levels: Electrical Engineering, Electronics & Computer Science (Macro Citation Topic 4) and Artificial Intelligence & Machine Learning (Meso Citation Topic 4.61). Relative Specialization Index (RSI) and Relative Citation Impact (RCI) indicators are calculated for 781 European NUTS-3 regions. While major metropolitan hubs such as Paris, Warszawa, and Madrid dominate in absolute publication volume, the results reveal that the highest levels of relative AI specialization are concentrated in peripheral regions, particularly in Eastern Europe and Spain. Granada and Vilniaus apskritis stand out as regions combining high specialization with strong citation visibility. The analysis further suggests a weak relationship between regional specialization and citation impact, revealing multiple regional profiles, including highly specialized regions with limited citation visibility, highly visible regions with comparatively low specialization, and diversified scientific systems combining moderate specialization with strong citation impact. Fyn emerges as an extreme case of very high citation impact despite relatively low specialization.2026-02-16T23:01:14Z15 pages, 3 figuresVictor Herrero-SolanaCarmen Gálvezhttp://arxiv.org/abs/2605.14188v1QOuLiPo: What a quantum computer sees when it reads a book2026-05-13T23:10:15ZWhat does a book look like to a quantum computer? This paper takes eight classical works of the Renaissance and its late-antique inheritance -- from Augustine to Galileo -- and runs each through a neutral-atom quantum processor. The bridge is graphs: each textual unit becomes an atom, and graph edges are physical blockade constraints for engineered exact unit-disk designs, or a 2D approximation to the semantic graph for natural texts.
Three contributions follow. First, we introduce rigidity rho, a metric for how unique a book's structural backbone is -- distinguishing Marguerite de Navarre's Heptameron (rigid, twelve-nouvelle hard core) from Boethius (fully fungible, every chapter substitutable). Second, we invert the pipeline: rather than extracting a graph from existing prose, we pick a target graph the hardware encodes natively, and write a book whose structure matches it. The twenty-nine texts written this way, collected under the name QOuLiPo, extend the OuLiPo tradition to graph-topological constraints and, together with the eight natural texts, form a benchmark distribution against which neutral-atom hardware can be tracked as it scales. Third, we run both natural and engineered texts on Pasqal's FRESNEL processor up to one hundred atoms; engineered texts reach high approximation ratios, the cleanest instances returning the exact backbone.
A cloud-accessible quantum machine plus an agentic coding environment now lets a single investigator run this pipeline end-to-end. What is reported is an application layer, not a speedup -- humanistic instances ready to load onto neutral-atom processors as they scale, already complementing classical text analysis. The Digital Humanities community has a stake in building familiarity with this hardware now: the engineered-corpus design choices made today fix the benchmark distribution future hardware will be measured against.2026-05-13T23:10:15ZChristophe Jurczakhttp://arxiv.org/abs/2605.13310v1SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem2026-05-13T10:25:43ZWe present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.2026-05-13T10:25:43ZAbdul RafayYuni SusantiDavid LamprechtMichael Färber