https://arxiv.org/api/5ssVR+WUXAOzDvYNBeuwppi2K0s 2026-06-10T10:42:34Z 6061 180 15 http://arxiv.org/abs/2604.04562v1 Paper Espresso: From Paper Overload to Research Insight 2026-04-06T09:45:21Z

The accelerating pace of scientific publishing makes it increasingly difficult for researchers to stay current. We present Paper Espresso, an open-source platform that automatically discovers, summarizes, and analyzes trending arXiv papers. The system uses large language models (LLMs) to generate structured summaries with topical labels and keywords, and provides multi-granularity trend analysis at daily, weekly, and monthly scales through LLM-driven topic consolidation. Over 35 months of continuous deployment, Paper Espresso has processed over 13,300 papers and publicly released all structured metadata, revealing rich dynamics in the AI research landscape: a mid-2025 surge in reinforcement learning for LLM reasoning, non-saturating topic emergence (6,673 unique topics), and a positive correlation between topic novelty and community engagement (2.0x median upvotes for the most novel papers). A live demo is available at https://huggingface.co/spaces/Elfsong/Paper_Espresso.

2026-04-06T09:45:21Z Mingzhe Du Luu Anh Tuan Dong Huang See-kiong Ng http://arxiv.org/abs/2603.15416v2 Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections 2026-04-05T23:17:01Z

Web archives preserve portions of the web, but quantifying their completeness remains challenging. Prior approaches have estimated the coverage of a crawl by either comparing the outcomes of multiple crawlers, or by comparing the results of a single crawl to external ground truth datasets. We propose a method to estimate the absolute coverage of a crawl using only the archive's own longitudinal data, i.e., the data collected by multiple subsequent crawls. Our key insight is that coverage can be estimated from the empirical URL overlaps between subsequent crawls, which are in turn well described by a simple urn process. The parameters of the urn model can then be inferred from longitudinal crawl data using linear regression. Applied to our focused crawl configuration of the German Academic Web, with 15 semi-annual crawls between 2013-2021, we find a coverage of approximately 46 percent of the crawlable URL space for the stable crawl configuration regime. Our method is extremely simple, requires no external ground truth, and generalizes to any longitudinal focused crawl.

2026-03-16T15:28:30Z Michael Paris Grigori Paris Fabian Baumann http://arxiv.org/abs/2604.03776v1 Bridging the Language Gap in Scholarly Data I: Enhancing Author Disambiguation Algorithms for Chinese Names 2026-04-04T15:55:20Z

Disambiguating scholars with identical names is essential for accurate authorship assignment and robust large-scale scientometric research. Existing methods are often designed for Latin-script metadata and perform poorly on Chinese names. In international publications, Chinese names typically appear as Romanized Pinyin, which is highly ambiguous as it can map to multiple distinct characters. Chinese characters, in contrast, reduce but do not eliminate this ambiguity, and are rarely available in international records. To address both challenges, we propose a rule-based disambiguation framework that integrates co-authorship networks, citation networks, author affiliations, and content similarity. We apply this framework to 65,241 physics papers from the China National Knowledge Infrastructure (CNKI), spanning over 70 years of data. On a human annotated sample of 80 name pairs, our method achieves F1-scores of 0.88 for Pinyin names and 0.89 for character-based names, outperforming two baseline approaches, with improvements driven primarily by higher recall. The comparable performance across both writing systems shows that our approach is script-agnostic, enabling reliable large-scale scientometric analyses.

2026-04-04T15:55:20Z Mingrong She Liuhuaying Yang Ana Maria Jaramillo Lisette Espín-Noboa http://arxiv.org/abs/2604.06236v1 LLMs Have Made Failure Worth Publishing 2026-04-04T13:57:49Z

Scientific publishing systematically filters out negative results. We argue that this long-standing asymmetry has become an urgent problem in the era of large language models, which inherit the positive bias of the literature they are trained on, face an impending shortage of high-quality training data, and are increasingly deployed as both research tools and peer reviewers. We analyze three ways in which LLMs have changed the value of failure data and show that the systematic absence of such data degrades their utility as research tools, training data consumers, and peer reviewers alike. We outline experimental protocols to validate these claims and discuss the structural conditions under which a failure-inclusive publishing culture could emerge.

2026-04-04T13:57:49Z Sungmin Lee http://arxiv.org/abs/2604.03553v1 Towards the AI Historian: Agentic Information Extraction from Primary Sources 2026-04-04T02:38:23Z

AI is supporting, accelerating, and automating scientific discovery across a diverse set of fields. However, AI adoption in historical research remains limited due to the lack of solutions designed for historians. In this technical progress report, we introduce the first module of Chronos, an AI Historian under development. This module enables historians to convert image scans of primary sources into data through natural-language interactions. Rather than imposing a fixed extraction pipeline powered by a vision-language model (VLM), it allows historians to adapt workflows for heterogeneous source corpora, evaluate the performance of AI models on specific tasks, and iteratively refine workflows through natural-language interaction with the Chronos agent. The module is open-source and ready to be used by historical researchers on their own sources.

2026-04-04T02:38:23Z Lorenz Hufe Niclas Griesshaber Gavin Greif Sebastian Oliver Eck Philip Torr http://arxiv.org/abs/2604.03159v1 BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation 2026-04-03T16:30:58Z

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

2026-04-03T16:30:58Z 37 pages Delip Rao Chris Callison-Burch http://arxiv.org/abs/2509.07801v4 SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP 2026-04-03T13:16:07Z

Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 6,409 entities and 1,648 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.3 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://github.com/AKADDC/SciNLP.

2025-09-09T14:41:40Z EMNLP 2025 Main Decheng Duan Yingyi Zhang Jitong Peng Chengzhi Zhang http://arxiv.org/abs/2604.06232v1 What Do Humanities Scholars Need? A User Model for Recommendation in Digital Archives 2026-04-02T21:11:15Z

User models for recommender systems (RecSys) typically assume stable preferences, similarity-based relevance, and session-bounded interactions -- assumptions derived from high-volume consumer contexts. This paper investigates these assumptions for humanities scholars working with digital archives. Following a human-centered design approach, we conducted focus groups and analyzed interview data from 18 researchers. Our analysis identifies four dimensions where scholarly information-seeking diverges from common RecSys user modeling: (1) context volatility -- preferences shift with research tasks and domain expertise; (2) epistemic trust -- relevance depends on verifiable provenance; (3) contrastive seeking -- researchers seek items that challenge their current direction; and (4) strand continuity -- research spans long-term threads rather than discrete sessions. We discuss implications for user modeling and outline how these dimensions relate to collaborative filtering, content-based, and session-based recommendation. We propose these dimensions as a diagnostic framework applicable beyond archives to similar application domains where typical user modeling assumptions may not hold.

2026-04-02T21:11:15Z To be presented at the 34th ACM Conference on User Modeling, Adaptation and Personalization (UMAP'26), June 08-11, 2026, Gothenburg, Sweden Florian Atzenhofer-Baumgartner Dominik Kowald 10.1145/3774935.3806171 http://arxiv.org/abs/2603.25638v2 Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers 2026-04-02T17:45:24Z

Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

2026-03-26T16:49:00Z Visualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/ Mingmeng Geng Yuhang Dong Thierry Poibeau http://arxiv.org/abs/2604.01793v1 Not Just Large: Tall Teams Dominate East Asia's Scientific Production 2026-04-02T09:05:03Z

Purpose: This study compares the hierarchical structure of scientific teams across countries and investigates factors associated with the observed cross-national differences. Design/methodology/approach: Drawing on 150,817 publications with author contribution statements, we focus on the 15 countries with the largest volume of scientific publications, examine cross-country variations in the proportion of tall teams, and analyze how this proportion correlates with other factors. Findings: Scientific output from East Asia is dominated by tall teams, which persist after controlling for team size, indicating that this pattern cannot be fully accounted for by the prevalence of larger teams in these countries. Cultural factors, measured by Power Distance, as well as the observed funding patterns of major basic science agencies, are associated with the dominance of tall teams in East Asia. Research limitations: This study is limited by its reliance on publications with author contribution statements, which may introduce selection bias; its focus on cultural and funding factors, while leaving other institutional contexts unexamined; and its use of a leadership concentration measure that does not capture other dimensions of hierarchy. Practical implications: Understanding cross-national differences in research team structures and their associated cultural and institutional factors can inform science policy and team management. Originality/value: This study provides a systematic cross-national comparison of team hierarchy and offers a mechanistic understanding of the dominance of tall teams in East Asia, highlighting associations with cultural and funding factors.

2026-04-02T09:05:03Z Siyuan Liu Wenjin Xie Wenyu Chen Tao Jia http://arxiv.org/abs/2604.01729v1 Overton Engage: A Structured Database and Matching System for Academic Policy Engagement Opportunities 2026-04-02T07:49:52Z

Academic policy engagement, the structured processes through which researchers contribute evidence and expertise to public decision-making, is shaped not only by research quality but by the accessibility of engagement opportunities. In practice, these opportunities are fragmented across institutions and platforms, unevenly advertised, and difficult to discover systematically (Parker et al., 2022), limiting both individual participation and comparison. We present Overton Engage (https://app.overton.io/ui/opportunities), a structured database of publicly documented academic policy engagement opportunities, together with a semantic matching system that links opportunities to researchers based on similarity between opportunity descriptions and publication records. We characterise the composition of the database across policy domains, countries, and opportunity types, and present UK-focused analyses comparing engagement opportunity topics with published policy documents. We further demonstrate an illustrative comparison of consultation topics between the UK and Australia, and apply a matching system to assess how closely research produced by UK higher education institutions aligns, topically, with domestic policy opportunities. Our results suggest that publicly documented engagement opportunities are unevenly distributed across policy domains and countries, though this may reflect collection bias. Matching analyses reveal a positive relationship between institutional publication volume and high-confidence match rates, but also that research specialisation can compensate for lower output volume in specific policy domains. The database itself is freely available and we welcome collaboration from researchers, policymakers, and institutions.

2026-04-02T07:49:52Z Ceire Wincott Angel Luis Jaso Tamame Susan Collard Euan Adie Katie Shamash http://arxiv.org/abs/2605.12514v1 Structural Diversity Drives Disruptive Scientific Innovation 2026-04-02T02:15:38Z

Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known "curse of scale" by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.

2026-04-02T02:15:38Z Yichun Peng Saike He Peijie Zhang Kang Zhao Yi Yang Ning Zhang Qingpeng Zhang Daniel Dajun Zeng Hao Peng http://arxiv.org/abs/2604.01186v1 From Validity to Inter-Subjectivity: An Argument for Reliability Signals in Search Environments 2026-04-01T17:34:45Z

Search engines and information platforms are increasingly scrutinized for their role in spreading misinformation. Traditional responses often focus on detecting falsehoods or verifying the ultimate validity of claims. This paper argues that such a validity-centered framing is inadequate for the epistemic challenges of search environments.

2026-04-01T17:34:45Z 4 pages. Extended abstract / conference paper for SEASON 2025 (September 24-25, 2025, Hamburg, Germany). Peer reviewed Frans van der Sluis http://arxiv.org/abs/2604.09669v1 Digital hybridity and relics in cultural heritage: using corpus linguistics to inform design in emerging technologies from AI to VR 2026-04-01T17:16:01Z

Hybrid technologies enable the blending of physical and digital elements, creating new ways to experience and interact with the world. Such technologies can transform engagement with relics, both secular and sacred but they present challenges for capturing faith, belief, and representation responsibly. Given the complexities of digital representation and the ethical challenges inherent in digitising culturally significant objects, a transdisciplinary understanding of these issues is needed. To inform this discussion from a linguistic perspective, we examined the representation of relics in historical and contemporary texts. Using a corpus linguistic approach to extract modifiers of the word relic in corpora of Early Modern English books and contemporary web sourced texts from 2021, we examined the multifaceted ways in which relics have been perceived and evaluated over time. Early texts consider relics as both objects of moral and spiritual significance, and tools of religious and political control, while they are more often framed as heritage symbols, reflecting past events, places, and traditions in contemporary texts. We discuss how hybrid, sometimes AI based technologies can enhance accessibility and engagement, whilst also challenging traditional sensitivities around authenticity and sensory experience, which are integral to the meaning and significance of relics.

2026-04-01T17:16:01Z This is a (ACM J.5 Arts & Humanities Paper) relating to Hybrid Technologies, Language, AI, VR, Interaction and Experience. 24 pages. Int J Digit Humanities (2026) Emma McClaughlin Glenn McGarry Alan Chamberlain Geert De Wilde Oliver Butler 10.1007/s42803-026-00120-4 http://arxiv.org/abs/2604.01073v1 Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics 2026-04-01T16:07:58Z

We test whether authors have characteristic "fingerprints" in the information-theoretic novelty curves of their published works. Working with two corpora -- Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) -- we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.

2026-04-01T16:07:58Z 12 pages, 6 figures, 4 tables Fred Zimmerman Hilmar AI