https://arxiv.org/api/OfEwp7giDEfhi0yVSU890ZV47hQ 2026-06-14T21:42:15Z 6065 825 15 http://arxiv.org/abs/2505.04309v1 Integrating Large Citation Datasets 2025-05-07T10:48:04Z

This paper explores methods for building a comprehensive citation graph using big data techniques to evaluate scientific impact more accurately. Traditional citation metrics have limitations, and this work investigates merging large citation datasets to create a more accurate picture. Challenges of big data, like inconsistent data formats and lack of unique identifiers, are addressed through deduplication efforts, resulting in a streamlined and reliable merged dataset with over 119 million records and 1.4 billion citations. We demonstrate that merging large citation datasets builds a more accurate citation graph facilitating a more robust evaluation of scientific impact.

2025-05-07T10:48:04Z 6 pages, 3 figures Inci Yueksel-Erguen Ida Litzel Hanqiu Peng http://arxiv.org/abs/2508.00826v1 Use of LLMs in preparing accessible scientific papers 2025-05-07T00:18:39Z

Making scientific papers accessible may require reprocessing old papers to create output compliant with accessibility standards. An important step there is to convert the visual formatting to the logical one. In this report we describe our attempt at zero shot conversion of arXiv papers. Our results are mixed: while it is possible to do conversion, the reliability is not too good. We discuss alternative approaches to this problem.

2025-05-07T00:18:39Z Allison Doami Christine James Dan Lu Lia Prins Annette Torrence Boris Veytsman http://arxiv.org/abs/2504.20065v3 A Computational Analysis and Visualization of In-Text Reference Networks Across Philosophical Texts 2025-05-06T19:56:06Z

We applied computational methods to analyze references across 2,245 philosophical texts, spanning from approximately 550 BCE to 1940 AD, in order to measure patterns in how philosophical ideas have spread over time. Using natural language processing and network analysis, we mapped over 294,970 references between authors, classifying each reference into subdisciplines of philosophy based on its surrounding context. We then constructed a graph, with authors as nodes and textual references as edges, to empirically validate, visualize, and quantify intellectual lineages as they are understood within philosophical scholarship. For instance, we find that Plato and Aristotle alone account for nearly 10% of all references from authors in our dataset, suggesting that their influence may still be underestimated. As another example, we support the view that St. Thomas Aquinas served as a synthesizer between Aristotelian and Christian philosophy by analyzing the network structures of Aquinas, Aristotle, and Christian theologians. Our results are presented through an interactive visualization tool, allowing users to dynamically explore these networks, alongside a mathematical analysis of the network's structure. Our methodology demonstrates the value of applying network analysis with textual references to study a large collection of historical works.

2025-04-22T21:18:40Z 57 pages, 41 figures, 3 tables. To submit to the Oxford Journal of the Digital Humanities Robert Becker Aron Culotta http://arxiv.org/abs/2505.02455v1 Running a Data Integration Lab in the Context of the EHRI Project: Challenges, Lessons Learnt and Future Directions 2025-05-05T08:39:18Z

Historical study of the Holocaust is commonly hampered by the dispersed and fragmented nature of important archival sources relating to this event. The EHRI project set out to mitigate this problem by building a trans-national network of archives, researchers, and digital practitioners, and one of its main outcomes was the creation of the EHRI Portal, a "virtual observatory" that gathers in one centralised platform descriptions of Holocaust-related archival sources from around the world. In order to build the Portal a strong data identification and integration effort was required, culminating in the project's third phase with the creation of the EHRI-3 data integration lab. The focus of the lab was to lower the bar to participation in the EHRI Portal by providing support to institutions in conforming their archival metadata with that required for integration, ultimately opening the process up to smaller institutions (and even so-called "micro-archives") without the necessary resources to undertake this process themselves. In this paper we present our experiences from running the data integration lab and discuss some of the challenges (both of a technical and social nature), how we tried to overcome them, and the overall lessons learnt. We envisage this work as an archetype upon which other practitioners seeking to pursue similar data integration activities can build their own efforts.

2025-05-05T08:39:18Z Submitted to the ACM Journal on Computing and Cultural Heritage Herminio García-González Mike Bryant Suzanne Swartz Fabio Rovigo Veerle Vanden Daelen http://arxiv.org/abs/2410.03342v2 A meta-analysis of impact factors of astrophysics journals 2025-05-04T16:56:57Z

We calculate the 2024 impact factors for the 38 most widely used journals in Astrophysics, using the citations collated by NASA/ADS (Astrophysics Data System) and compare them to the official impact factors. This includes journals which publish papers outside of astrophysics such as PRD, EPJC, Nature, etc. We also propose a new metric to gauge the impact factor based on the median number of citations in a journal and calculate the same for all the journals. We find that the ADS-based impact factors are mostly in agreement, albeit higher than the official impact factors for most journals. The journals with the maximum fractional difference in median-based and old impact factors are JHEAP and PTEP. We find the maximum difference between the ADS and official impact factor for Nature.

2024-10-04T11:58:06Z 10 pages, 2 figures. More journals added. Also added miscellaneous publication statistics as well as APC details. Accepted for publication in EPJP Rayani Venkat Sai Rithvik Shantanu Desai 10.1140/epjp/s13360-025-06397-8 http://arxiv.org/abs/2505.01724v1 VisTaxa: Developing a Taxonomy of Historical Visualizations 2025-05-03T07:28:16Z

Historical visualizations are a rich resource for visualization research. While taxonomy is commonly used to structure and understand the design space of visualizations, existing taxonomies primarily focus on contemporary visualizations and largely overlook historical visualizations. To address this gap, we describe an empirical method for taxonomy development. We introduce a coding protocol and the VisTaxa system for taxonomy labeling and comparison. We demonstrate using our method to develop a historical visualization taxonomy by coding 400 images of historical visualizations. We analyze the coding result and reflect on the coding process. Our work is an initial step toward a systematic investigation of the design space of historical visualizations.

2025-05-03T07:28:16Z Accepted to IEEE TVCG (IEEE PacificVis 2025 Journal Track) Yu Zhang Xinyue Chen Weili Zheng Yuhan Guo Guozheng Li Siming Chen Xiaoru Yuan 10.1109/TVCG.2025.3567132 http://arxiv.org/abs/2505.00907v1 Co-Designing a Knowledge Graph Navigation Interface: A Participatory Approach 2025-05-01T23:00:31Z

Navigating and visualizing multilayered knowledge graphs remains a challenging, unresolved problem in information systems design. Building on our earlier study, which engaged end users in both the design and population of a domain-specific knowledge graph, we now focus on translating their insights into actionable interface guidelines. In this paper, we synthesize recommendations drawn from a participatory workshop with doctoral students. We then demonstrate how these recommendations inform the design of a prototype interface. Finally, we found that a participatory iterative design approach can help designers in decision making, leading to interfaces that are both innovative and user-centric. By combining user-driven requirements with proven visualization techniques, this paper presents a coherent framework for guiding future development of knowledge-graph navigation tools.

2025-05-01T23:00:31Z Stanislava Gardasevic Manika Lamba Jasmine S. Malone http://arxiv.org/abs/2504.21589v1 DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing 2025-04-30T12:47:09Z

This paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog. Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record. Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.

2025-04-30T12:47:09Z 11 pages, 4 figures, submitted to SemEval-2025 workshop Task 5: LLMs4Subjects In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1118-1128, Vienna, Austria. Association for Computational Linguistics Lisa Kluge Maximilian Kähler http://arxiv.org/abs/2505.13456v1 PANDAVA: Semantic and Reflexive Protocol for Interdisciplinary and Cognitive Knowledge Synthesis 2025-04-29T21:35:30Z

Modern science faces the need to move from linear systematic review protocols to deeper cognitive navigation across fields of knowledge. In this context, the PANDAVA protocol (Protocol for Analysis and Navigation of Deep Argumentative and Valued Knowledge) is designed for analysing the semantic structures of scientific knowledge. It combines semantic mapping, assessment of concept maturity, clustering, and generation of new hypotheses. PANDAVA is interpreted as the first interdisciplinary protocol for knowledge systematization focused on semantic and cognitive mapping. The PANDAVA protocol integrates quantitative analysis methods with reflective procedures for comprehending the structure of knowledge and is applied in interdisciplinary, theoretically saturated fields where traditional models such as PRISMA prove insufficient. As an example, the protocol was applied to analyse the abiogenesis hypotheses. Modelling demonstrated how to structure theories of the origin of life through the integration of data on microlight, turbulent processes, and geochemical sources. PANDAVA enables researchers to identify strong and weak concepts, construct knowledge maps, and develop new hypotheses. Overall, PANDAVA represents a cognitively enriched tool for meaningful knowledge management, fostering the transition from the representation of facts to the design of new scientific paradigms.

2025-04-29T21:35:30Z Eldar Knar http://arxiv.org/abs/2504.21100v1 A smack of all neighbouring languages: How multilingual is scholarly communication? 2025-04-29T18:18:52Z

Language is a major source of systemic inequities in science, particularly among scholars whose first language is not English. Studies have examined scientists' linguistic practices in specific contexts; few, however, have provided a global analysis of multilingualism in science. Using two major bibliometric databases (OpenAlex and Dimensions), we provide a large-scale analysis of linguistic diversity in science, considering both the language of publications (N=87,577,942) and of cited references (N=1,480,570,087). For the 1990-2023 period, we find that only Indonesian, Portuguese and Spanish have expanded at a faster pace than English. Country-level analyses show that this trend is due to the growing strength of the Latin American and Indonesian academic circuits. Our results also confirm the own-language preference phenomenon (particularly for languages other than English), the strong connection between multilingualism and bibliodiversity, and that social sciences and humanities are the least English-dominated fields. Our findings suggest that policies recognizing the value of both national-language and English-language publications have had a concrete impact on the distribution of languages in the global field of scholarly communication.

2025-04-29T18:18:52Z Carolina Pradier Lucía Céspedes Vincent Larivière http://arxiv.org/abs/2505.11503v1 The Impact and Influence of Academic Genealogies 2025-04-29T15:02:10Z

We introduce the concept of an academic genealogy, or AG, and illustrate how AG charts may be constructed and then demonstrate how this methodology can be used by applying it to create the partial or full AG charts to two scientists, Paul A. Samuelson and Ronald E. Mickens.

2025-04-29T15:02:10Z Bryan Briones Ronald E. Mickens Charmayne Patterson http://arxiv.org/abs/2504.20323v1 Labeling Case Similarity based on Co-Citation of Legal Articles in Judgment Documents with Empirical Dispute-Based Evaluation 2025-04-29T00:26:37Z

This report addresses the challenge of limited labeled datasets for developing legal recommender systems, particularly in specialized domains like labor disputes. We propose a new approach leveraging the co-citation of legal articles within cases to establish similarity and enable algorithmic annotation. This method draws a parallel to the concept of case co-citation, utilizing cited precedents as indicators of shared legal issues. To evaluate the labeled results, we employ a system that recommends similar cases based on plaintiffs' accusations, defendants' rebuttals, and points of disputes. The evaluation demonstrates that the recommender, with finetuned text embedding models and a reasonable BiLSTM module can recommend labor cases whose similarity was measured by the co-citation of the legal articles. This research contributes to the development of automated annotation techniques for legal documents, particularly in areas with limited access to comprehensive legal databases.

2025-04-29T00:26:37Z 16 pages, 9 figures, 2 tables, the Nineteenth International Workshop on Juris-Informatics (JURISIN 2025), associated with the Seventeenth JSAI International Symposium on AI (JSAI-isAI 2025) Lecture Notes in Artificial Intelligence (volumn number to be added), 2025 Chao-Lin Liu Po-Hsien Wu Yi-Ting Yu http://arxiv.org/abs/2504.20125v1 Towards Large Language Models for Lunar Mission Planning and In Situ Resource Utilization 2025-04-28T13:33:37Z

A key factor for lunar mission planning is the ability to assess the local availability of raw materials. However, many potentially relevant measurements are scattered across a variety of scientific publications. In this paper we consider the viability of obtaining lunar composition data by leveraging LLMs to rapidly process a corpus of scientific publications. While leveraging LLMs to obtain knowledge from scientific documents is not new, this particular application presents interesting challenges due to the heterogeneity of lunar samples and the nuances involved in their characterization. Accuracy and uncertainty quantification are particularly crucial since many materials properties can be sensitive to small variations in composition. Our findings indicate that off-the-shelf LLMs are generally effective at extracting data from tables commonly found in these documents. However, there remains opportunity to further refine the data we extract in this initial approach; in particular, to capture fine-grained mineralogy information and to improve performance on more subtle/complex pieces of information.

2025-04-28T13:33:37Z Michael Pekala Gregory Canal Samuel Barham Milena B. Graziano Morgan Trexler Leslie Hamilton Elizabeth Reilly Christopher D. Stiles http://arxiv.org/abs/2407.10652v2 Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews 2025-04-28T07:58:34Z

Systematic literature reviews (SLRs) are essential but labor-intensive due to high publication volumes and inefficient keyword-based filtering. To streamline this process, we evaluate Large Language Models (LLMs) for enhancing efficiency and accuracy in corpus filtration while minimizing manual effort. Our open-source tool LLMSurver presents a visual interface to utilize LLMs for literature filtration, evaluate the results, and refine queries in an interactive way. We assess the real-world performance of our approach in filtering over 8.3k articles during a recent survey construction, comparing results with human efforts. The findings show that recent LLM models can reduce filtering time from weeks to minutes. A consensus scheme ensures recall rates >98.8%, surpassing typical human error thresholds and improving selection accuracy. This work advances literature review methodologies and highlights the potential of responsible human-AI collaboration in academic research.

2024-07-15T12:13:53Z 6 pages, 5 figures, 1 table Proceedings of the 16th International EuroVis Workshop on Visual Analytics (EuroVA), 2025 Lucas Joos Daniel A. Keim Maximilian T. Fischer 10.2312/eurova.20251105 http://arxiv.org/abs/2504.18896v1 Effect of perceived preprint effectiveness and research intensity on posting behaviour 2025-04-26T11:34:57Z

Open science is increasingly recognised worldwide, with preprint posting emerging as a key strategy. This study explores the factors influencing researchers' adoption of preprint publication, particularly the perceived effectiveness of this practice and research intensity indicators such as publication and review frequency. Using open data from a comprehensive survey with 5,873 valid responses, we conducted regression analyses to control for demographic variables. Researchers' productivity, particularly the number of journal articles and books published, greatly influences the frequency of preprint deposits. The perception of the effectiveness of preprints follows this. Preprints are viewed positively in terms of early access to new research, but negatively in terms of early feedback. Demographic variables, such as gender and the type of organisation conducting the research, do not have a significant impact on the production of preprints when other factors are controlled for. However, the researcher's discipline, years of experience and geographical region generally have a moderate effect on the production of preprints. These findings highlight the motivations and barriers associated with preprint publication and provide insights into how researchers perceive the benefits and challenges of this practice within the broader context of open science.

2025-04-26T11:34:57Z 24 pages, 5 tables Pablo Dorta-González María Isabel Dorta-González