https://arxiv.org/api/HueOAWgaa331Ns8q5zwXffSmcUw 2026-06-09T21:33:04Z 6057 15 15 http://arxiv.org/abs/2606.05443v1 MIRAI: Prediction and Generation of High-Impact Academic Research 2026-06-03T21:06:01Z

The rapid pace of scientific publishing has made the identification and synthesis of high-impact work an increasingly urgent challenge. We introduce MIRAI (Multi-year Inference of Research trends and Academic Impact), a deep learning framework that predicts paper impact using only it's title, abstract, and publication date. We train MIRAI on the arXiv academic graph to predict 5-year PageRank and citation counts, achieving Spearman's $ρ$ of 0.4686 on PageRank prediction and 0.6192 on citation prediction for papers published in 2021. We propose a research ideation pipeline built on top of MIRAI that produces research ideas oriented towards high impact. These ideas were judged as more impactful than a baseline without MIRAI by an unbiased LLM judge at a 4:3 ratio. We make the 5-year citation prediction model publicly available at https://predict-paper-impact.vercel.app.

2026-06-03T21:06:01Z Alex Li Joseph Jacobson http://arxiv.org/abs/2606.04852v1 A Note on the Kullback-Leibler Divergence in Discretized Empirical Distributions 2026-06-03T13:20:51Z

When empirical objects are represented as discrete probability distributions, within-distribution summaries such as Shannon entropy and Hill-type diversity indices describe how probability mass is spread inside each object, while Kullback-Leibler (KL) divergence provides pairwise asymmetric information. This note focuses on the KL difference $Δ_{\mathrm{KL}}(p,q)=D_{\mathrm{KL}}(p|q)-D_{\mathrm{KL}}(q|p)$. Although $Δ_{\mathrm{KL}}$ can add information beyond within-distribution summaries and symmetric overlap, its sign does not, by itself, establish support inclusion, coverage, or breadth. It is better understood as a weighted category-wise log-ratio contrast reflecting asymmetric probability-mass placement. The point becomes clear once the definition is written out. The aim of this note is therefore to present it in a compact, example-based form, together with a descriptive bibliometric illustration based on COVID-19-related preprint-server topic distributions.

2026-06-03T13:20:51Z Hayami Osaki http://arxiv.org/abs/2606.07661v1 PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing 2026-06-03T13:10:47Z

Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.

2026-06-03T13:10:47Z Code and data available at https://github.com/makSShandybo/PereStruct Maksim Shandybo Ivan Bespalov Daniil Yefimov Marina Kosheleva Alexander Loukianov http://arxiv.org/abs/2606.04382v1 LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment 2026-06-03T02:58:11Z

Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.

2026-06-03T02:58:11Z Kwok Leong Tang http://arxiv.org/abs/2606.03919v1 Forecasting Conceptual Diffusion in Science: The Case of Quantum Computing 2026-06-02T17:12:02Z

Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ($R^2$ up to $0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ($R^2_test \sim 0.60-0.87$), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.

2026-06-02T17:12:02Z 19 pages, 5 figures, 6 tables. Code and manuscript sources: https://github.com/wazaahhh/breakthroughs-diffusion . An earlier version was presented at the Global Tech Mining Conference (GTM) 2026 (submission #117) Thomas Maillart Thibaut Chataing David Dosu Paul Bagourd Julian Jang-Jaccard Alain Mermoud http://arxiv.org/abs/2606.03864v1 Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics 2026-06-02T16:38:41Z

We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC-AUC in [0.954, 0.967] at all horizons without re-tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors -- particularly Adamic-Adar similarity and degree-based Hadamard measures -- consistently drive accuracy, suggesting that breakthrough-relevant recombinations emerge in tightly connected sub-networks. Two expert-anchored cases, quantum annealing and AI-enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three-layer decision architecture -- detection, expert translation, institutional integration -- that turns these forecasts into evidence-based research strategy and policy, anchored in open data and explainable features.

2026-06-02T16:38:41Z 18 pages, 10 figures, 4 tables. An earlier version was presented at Global Tech Mining Conference 2026. Code and data: https://github.com/wazaahhh/breakthroughs-forecasting Thomas Maillart Thibaut Chataing Ntorina Antoni David Dosu Paul Bagourd Julian Jang-Jaccard Alain Mermoud http://arxiv.org/abs/2603.26791v3 Crystal: Characterizing Relative Impact of Scholarly Publications 2026-06-02T15:20:04Z

Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. Crystal outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. Crystal further gains efficiency through fewer LLM calls and outperforms prior baselines using an open-weight model, enabling scalable, cost-effective citation impact analysis. In a case study of ACL Test-of-Time award-winning papers, we find that Crystal's impact characterizations align closely with long-term scientific recognition. We release Crystal-Bank, a 46.8k-paper dataset with rankings and impact labels, along with code.

2026-03-25T16:42:30Z Hannah Collison Benjamin Van Durme Daniel Khashabi http://arxiv.org/abs/2606.03362v1 Emerging and established topics in drone research: Citation impact and knowledge flows across China, the United States, the EU, Ukraine, and Russia (2020-2025) 2026-06-02T09:12:11Z

This study examined emerging and established topics in drone research, focusing on citation impact and knowledge flows across China, the United States, the EU, Ukraine, and Russia between 2020 and 2025 using OpenAlex bibliographic data. The findings revealed that drone-related science is characterised by growing geopolitical asymmetries in scientific production, citation concentration, and international knowledge exchange. In particular, China increasingly dominated scientific production, fractional authorship contribution, and domestic citation circulation. In contrast, the United States and EU countries maintained comparatively more internationally distributed citation structures. However, China-affiliated publications became increasingly integrated into global citation networks, particularly through growing citation exchange with the United States and European countries. Notably, the interpretation of authorship and citation patterns was complicated by the high proportion of publications with unidentified affiliations, which reached 50% in 2025 within weak-signal topics. These findings underscore the importance of developing comprehensive national Research Organisation Registries (RORs). Although China demonstrated a citation advantage, this was partly driven by high internal domestic citation concentration rather than exclusively by global integration. Moreover, China still imported proportionally more knowledge from the EU-14 and the United States than it exported, with this asymmetry increasing over time. EU-14 countries maintained the strongest citation impact in weak-signal topics, suggesting a more prominent role in shaping emerging research directions. At the same time, China-affiliated publications cited the United States more frequently than the EU-14 in both strong- and weak-signal topics, with this pattern being particularly pronounced in weak-signal areas.

2026-06-02T09:12:11Z Myroslava Hladchenko http://arxiv.org/abs/2606.03117v1 Excessive use, ill use and misuse of Bibliometrics 2026-06-02T04:00:47Z

Impact factor, H-index, citation index, and such other indices have been playing an increasing role in scientific assessment of institutions, researchers, allocation of research funds,... across the globe. These indicies do not have any statistical basis but lots of decisions such as ranking of institutions, ranking of departments, assessment of faculty members for hiring and for promotions as well as selection for various awards are being made by using these indices. Several experts across disciplines have been writing that these indicies should have a marginal role, if any, and judgements should be based on critical assessment of the content by experts. But the dependence on these s steadily increasing. This article cites various such documents being published across the globe and across the disciplies and urging decision makers to ignore or at best give a minimal weight to such indices.

2026-06-02T04:00:47Z This article is meant for all sciences, specially decision makers Rajeeva Laxman Karandikar http://arxiv.org/abs/2606.02905v1 Speaker Mining -- FAIR Data on Public Broadcasts for Question Answering 2026-06-01T21:20:51Z

Public broadcasts are at the center of civic discourse: Traditional television talk shows, alongside emerging podcast and web video formats, capture and guide the attention of our societies, shaping how citizens encounter politics, science, and societal issues. Yet, systematic or even simple analyses of these formats face similar challenges: guest and content metadata are scarce, fleeting, fragmented, and not standardized. Research conducted and questions answered are based on extensive, laborious, yet isolated data-curation efforts that capture only a fraction of the relevant landscape. This work seeks to address this issue using a scaling-oriented framework for FAIR data curation in public broadcasting. Evaluated on 15 broadcasting programs, the pipeline aggregates ZDF Archive PDFs, fernsehserien.de, and Wikidata into a unified knowledge graph. Of the 31,817 candidate guest mentions from these three sources, 17,729 could be automatically disambiguated, further 5,958 via 64 hours of manual reconciling using OpenRefine. Results are published at speakermining.wikibase.cloud and linked to Wikidata, enabling SPARQL-based question answering based on gender, age, occupation, or institutional affiliation across 8,436 canonical persons with 23,527 appearances in 6,469 aligned episodes. Our iterative experience reveals that correctly disambiguating and deduplicating speaker data from heterogeneous sources demands dedicated effort on sustainable infrastructure. For scalable and reliable question answering on public broadcasts to be accessible to everyone, we recommend fostering the potential of linked open data: Advancing alignment and utilization approaches like this work, particularly towards crowdsourced development and curation, but also more FAIR data interfaces from public broadcast service providers.

2026-06-01T21:20:51Z 17 pages, 5 figures, submitted to TPDL 2026 Tim Wittenborg Omar Imad Remmo Claudia Frick Lena John Oliver Karras Sören Auer http://arxiv.org/abs/2606.02184v1 The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing 2026-06-01T12:38:56Z

These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.

2026-06-01T12:38:56Z Michał Brzozowski Neo Christopher Chung http://arxiv.org/abs/2411.05196v3 Explainable AI Through a Democratic Lens: DhondtXAI for D'Hondt-Projected Feature Attribution 2026-05-31T19:32:10Z

This study presents DhondtXAI as a SHAP-independent, D'Hondt-based attribution framework for tabular XAI. Instead of model-native feature importance or SHAP values, DhondtXAI computes background-interventional removal effects, separates positive and negative evidence, forms optional feature alliances, applies optional thresholds, allocates seats via the D'Hondt rule, and projects onto the local model-output difference. Completeness is preserved by construction, with the projection residual ratio reported as a diagnostic. The method is evaluated on synthetic additive and interaction tests, correlated-feature perturbations, operator and apportionment ablations, projection-mode comparisons, logit-scale checks, repeated split validation, paired deletion tests, and two healthcare datasets: Wisconsin Diagnostic Breast Cancer (CatBoost) and early-stage diabetes risk prediction (XGBoost). SHAP serves only as an external comparator with aligned settings. In additive synthetics, DhondtXAI exactly recovers ground-truth rankings; in multiplicative interactions, alliances reduce the mean projection residual from 0.2527 to 0.0001. On WDBC and diabetes data, it shows high agreement with SHAP (Spearman rho = 0.9273 and 0.9353), supported by further signed, top-k, magnitude, deletion, and sensitivity analyses. Results position DhondtXAI as a complementary proportional, alliance-aware, and threshold-aware tabular XAI method, not a replacement for SHAP or LIME.

2024-11-07T21:43:29Z Turker Berk Donmez http://arxiv.org/abs/2606.01127v1 How Proposal Novelty, Topical Diversity, and Theory-Practice Balance Shape Scholarly Outcomes in Funded Education Research 2026-05-31T10:01:21Z

Education research occupies a distinctive position in public science because it is expected to advance scholarly knowledge while also informing learning, teaching, participation, and workforce development. This study examines how the intellectual characteristics of NSF-funded education proposals are associated with the subsequent academic performance of funded scholars. Linking 8,715 NSF education awards from 1990 to 2020 with 84,519 publications by principal investigators, the analysis focuses on four major NSF education divisions that collectively span undergraduate and graduate levels, formal and informal learning environments, and inclusive educational initiatives. Proposal novelty is measured as semantic distance from prior funded projects within the same division, topical diversity as breadth across latent research themes, and intellectual orientation as theoretical, practical, or balanced. The results show that NSF education funding is consistently associated with higher publication output across divisions. However, this increase is not accompanied by stronger citation performance or higher journal-level visibility; citation and CiteScore estimates are often negative, particularly in later decades. Proposal novelty shows limited and uneven associations with post-award outcomes, whereas topical diversity is more clearly related to publication growth in some divisions but weaker citation-based performance in others. Balanced proposals that integrate theoretical and practical aims display the most favourable overall profile, combining positive publication associations with fewer negative citation-based patterns. These findings highlight the importance of evaluating education research funding through multiple academic outcomes and division-specific research contexts.

2026-05-31T10:01:21Z Yunfeng Gao Yuxuan Xiao Jiaming Zhang Yang Ding http://arxiv.org/abs/2606.01124v1 Frontlines and faultlines: How the Russo-Ukrainian conflict reshapes the landscape of scientific research 2026-05-31T09:50:20Z

Geopolitical conflict poses significant challenges to research and innovation policy by disrupting scientific systems and talent mobility. This study analyzes the impact of the conflict between Russia and Ukraine, particularly the escalations in 2014 and 2022, on the academic landscapes of both countries. We analyzed publication data from 2000 to 2023, encompassing over 1.8 million papers, one million scholars, and 2300 institutions across Ukraine and Russia, alongside collaboration data spanning 193 regions. We tracked scholar migration, research topics, and evolving international networks. Significant migration followed the 2014 and 2022 events, causing severe talent loss and a sharp decline in domestic research visibility in Ukraine. Migrated Ukrainian scholars shifted toward internationalized basic sciences, whereas active scholars who remained focused on applied fields relevant to national resilience and reconstruction. Both groups experienced decreased output in resource dependent fields, particularly medical research. Global networks fractured: traditional ties between Russia and the West, as well as between Ukraine and Russia, dissolved. These were replaced by new alignments between Russia and neighboring countries, and between Ukraine and the West. Migrating Ukrainian scholars face challenges assuming key research roles, though academic communities in smaller host nations showed a trend toward leadership positions. Concurrently, Russian scholars saw a decline in research prominence across most countries due to international sanctions. These findings reveal how conflict disrupts national scientific capacity, fractures global research networks, and affects individual academic careers, highlighting the need for targeted policies to support vulnerable academic communities during crises.

2026-05-31T09:50:20Z Yang Ding http://arxiv.org/abs/2606.01109v1 Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities 2026-05-31T08:59:49Z

Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.

2026-05-31T08:59:49Z This is an extended abstract, peer-reviewed and presented at CiteX2026 https://sites.google.com/view/workshop-on-citation-extractio/startseite Luca Foppiano Christian Boulanger