https://arxiv.org/api/OfEwp7giDEfhi0yVSU890ZV47hQ2026-06-14T21:42:15Z606582515http://arxiv.org/abs/2505.04309v1Integrating Large Citation Datasets2025-05-07T10:48:04ZThis paper explores methods for building a comprehensive citation graph using big data techniques to evaluate scientific impact more accurately. Traditional citation metrics have limitations, and this work investigates merging large citation datasets to create a more accurate picture. Challenges of big data, like inconsistent data formats and lack of unique identifiers, are addressed through deduplication efforts, resulting in a streamlined and reliable merged dataset with over 119 million records and 1.4 billion citations. We demonstrate that merging large citation datasets builds a more accurate citation graph facilitating a more robust evaluation of scientific impact.2025-05-07T10:48:04Z6 pages, 3 figuresInci Yueksel-ErguenIda LitzelHanqiu Penghttp://arxiv.org/abs/2508.00826v1Use of LLMs in preparing accessible scientific papers2025-05-07T00:18:39ZMaking scientific papers accessible may require reprocessing old papers to create output compliant with accessibility standards. An important step there is to convert the visual formatting to the logical one. In this report we describe our attempt at zero shot conversion of arXiv papers. Our results are mixed: while it is possible to do conversion, the reliability is not too good. We discuss alternative approaches to this problem.2025-05-07T00:18:39ZAllison DoamiChristine JamesDan LuLia PrinsAnnette TorrenceBoris Veytsmanhttp://arxiv.org/abs/2504.20065v3A Computational Analysis and Visualization of In-Text Reference Networks Across Philosophical Texts2025-05-06T19:56:06ZWe applied computational methods to analyze references across 2,245 philosophical texts, spanning from approximately 550 BCE to 1940 AD, in order to measure patterns in how philosophical ideas have spread over time. Using natural language processing and network analysis, we mapped over 294,970 references between authors, classifying each reference into subdisciplines of philosophy based on its surrounding context. We then constructed a graph, with authors as nodes and textual references as edges, to empirically validate, visualize, and quantify intellectual lineages as they are understood within philosophical scholarship. For instance, we find that Plato and Aristotle alone account for nearly 10% of all references from authors in our dataset, suggesting that their influence may still be underestimated. As another example, we support the view that St. Thomas Aquinas served as a synthesizer between Aristotelian and Christian philosophy by analyzing the network structures of Aquinas, Aristotle, and Christian theologians. Our results are presented through an interactive visualization tool, allowing users to dynamically explore these networks, alongside a mathematical analysis of the network's structure. Our methodology demonstrates the value of applying network analysis with textual references to study a large collection of historical works.2025-04-22T21:18:40Z57 pages, 41 figures, 3 tables. To submit to the Oxford Journal of the Digital HumanitiesRobert BeckerAron Culottahttp://arxiv.org/abs/2505.02455v1Running a Data Integration Lab in the Context of the EHRI Project: Challenges, Lessons Learnt and Future Directions2025-05-05T08:39:18ZHistorical study of the Holocaust is commonly hampered by the dispersed and fragmented nature of important archival sources relating to this event. The EHRI project set out to mitigate this problem by building a trans-national network of archives, researchers, and digital practitioners, and one of its main outcomes was the creation of the EHRI Portal, a "virtual observatory" that gathers in one centralised platform descriptions of Holocaust-related archival sources from around the world. In order to build the Portal a strong data identification and integration effort was required, culminating in the project's third phase with the creation of the EHRI-3 data integration lab. The focus of the lab was to lower the bar to participation in the EHRI Portal by providing support to institutions in conforming their archival metadata with that required for integration, ultimately opening the process up to smaller institutions (and even so-called "micro-archives") without the necessary resources to undertake this process themselves. In this paper we present our experiences from running the data integration lab and discuss some of the challenges (both of a technical and social nature), how we tried to overcome them, and the overall lessons learnt. We envisage this work as an archetype upon which other practitioners seeking to pursue similar data integration activities can build their own efforts.2025-05-05T08:39:18ZSubmitted to the ACM Journal on Computing and Cultural HeritageHerminio García-GonzálezMike BryantSuzanne SwartzFabio RovigoVeerle Vanden Daelenhttp://arxiv.org/abs/2410.03342v2A meta-analysis of impact factors of astrophysics journals2025-05-04T16:56:57ZWe calculate the 2024 impact factors for the 38 most widely used journals in Astrophysics, using the citations collated by NASA/ADS (Astrophysics Data System) and compare them to the official impact factors. This includes journals which publish papers outside of astrophysics such as PRD, EPJC, Nature, etc. We also propose a new metric to gauge the impact factor based on the median number of citations in a journal and calculate the same for all the journals. We find that the ADS-based impact factors are mostly in agreement, albeit higher than the official impact factors for most journals. The journals with the maximum fractional difference in median-based and old impact factors are JHEAP and PTEP. We find the maximum difference between the ADS and official impact factor for Nature.2024-10-04T11:58:06Z10 pages, 2 figures. More journals added. Also added miscellaneous publication statistics as well as APC details. Accepted for publication in EPJPRayani Venkat Sai RithvikShantanu Desai10.1140/epjp/s13360-025-06397-8http://arxiv.org/abs/2505.01724v1VisTaxa: Developing a Taxonomy of Historical Visualizations2025-05-03T07:28:16ZHistorical visualizations are a rich resource for visualization research. While taxonomy is commonly used to structure and understand the design space of visualizations, existing taxonomies primarily focus on contemporary visualizations and largely overlook historical visualizations. To address this gap, we describe an empirical method for taxonomy development. We introduce a coding protocol and the VisTaxa system for taxonomy labeling and comparison. We demonstrate using our method to develop a historical visualization taxonomy by coding 400 images of historical visualizations. We analyze the coding result and reflect on the coding process. Our work is an initial step toward a systematic investigation of the design space of historical visualizations.2025-05-03T07:28:16ZAccepted to IEEE TVCG (IEEE PacificVis 2025 Journal Track)Yu ZhangXinyue ChenWeili ZhengYuhan GuoGuozheng LiSiming ChenXiaoru Yuan10.1109/TVCG.2025.3567132http://arxiv.org/abs/2505.00907v1Co-Designing a Knowledge Graph Navigation Interface: A Participatory Approach2025-05-01T23:00:31ZNavigating and visualizing multilayered knowledge graphs remains a challenging, unresolved problem in information systems design. Building on our earlier study, which engaged end users in both the design and population of a domain-specific knowledge graph, we now focus on translating their insights into actionable interface guidelines. In this paper, we synthesize recommendations drawn from a participatory workshop with doctoral students. We then demonstrate how these recommendations inform the design of a prototype interface. Finally, we found that a participatory iterative design approach can help designers in decision making, leading to interfaces that are both innovative and user-centric. By combining user-driven requirements with proven visualization techniques, this paper presents a coherent framework for guiding future development of knowledge-graph navigation tools.2025-05-01T23:00:31ZStanislava GardasevicManika LambaJasmine S. Malonehttp://arxiv.org/abs/2504.21589v1DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing2025-04-30T12:47:09ZThis paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog. Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record. Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.2025-04-30T12:47:09Z11 pages, 4 figures, submitted to SemEval-2025 workshop Task 5: LLMs4SubjectsIn Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1118-1128, Vienna, Austria. Association for Computational LinguisticsLisa KlugeMaximilian Kählerhttp://arxiv.org/abs/2505.13456v1PANDAVA: Semantic and Reflexive Protocol for Interdisciplinary and Cognitive Knowledge Synthesis2025-04-29T21:35:30ZModern science faces the need to move from linear systematic review protocols to deeper cognitive navigation across fields of knowledge. In this context, the PANDAVA protocol (Protocol for Analysis and Navigation of Deep Argumentative and Valued Knowledge) is designed for analysing the semantic structures of scientific knowledge. It combines semantic mapping, assessment of concept maturity, clustering, and generation of new hypotheses. PANDAVA is interpreted as the first interdisciplinary protocol for knowledge systematization focused on semantic and cognitive mapping. The PANDAVA protocol integrates quantitative analysis methods with reflective procedures for comprehending the structure of knowledge and is applied in interdisciplinary, theoretically saturated fields where traditional models such as PRISMA prove insufficient. As an example, the protocol was applied to analyse the abiogenesis hypotheses. Modelling demonstrated how to structure theories of the origin of life through the integration of data on microlight, turbulent processes, and geochemical sources. PANDAVA enables researchers to identify strong and weak concepts, construct knowledge maps, and develop new hypotheses. Overall, PANDAVA represents a cognitively enriched tool for meaningful knowledge management, fostering the transition from the representation of facts to the design of new scientific paradigms.2025-04-29T21:35:30ZEldar Knarhttp://arxiv.org/abs/2504.21100v1A smack of all neighbouring languages: How multilingual is scholarly communication?2025-04-29T18:18:52ZLanguage is a major source of systemic inequities in science, particularly among scholars whose first language is not English. Studies have examined scientists' linguistic practices in specific contexts; few, however, have provided a global analysis of multilingualism in science. Using two major bibliometric databases (OpenAlex and Dimensions), we provide a large-scale analysis of linguistic diversity in science, considering both the language of publications (N=87,577,942) and of cited references (N=1,480,570,087). For the 1990-2023 period, we find that only Indonesian, Portuguese and Spanish have expanded at a faster pace than English. Country-level analyses show that this trend is due to the growing strength of the Latin American and Indonesian academic circuits. Our results also confirm the own-language preference phenomenon (particularly for languages other than English), the strong connection between multilingualism and bibliodiversity, and that social sciences and humanities are the least English-dominated fields. Our findings suggest that policies recognizing the value of both national-language and English-language publications have had a concrete impact on the distribution of languages in the global field of scholarly communication.2025-04-29T18:18:52ZCarolina PradierLucía CéspedesVincent Larivièrehttp://arxiv.org/abs/2505.11503v1The Impact and Influence of Academic Genealogies2025-04-29T15:02:10ZWe introduce the concept of an academic genealogy, or AG, and illustrate how AG charts may be constructed and then demonstrate how this methodology can be used by applying it to create the partial or full AG charts to two scientists, Paul A. Samuelson and Ronald E. Mickens.2025-04-29T15:02:10ZBryan BrionesRonald E. MickensCharmayne Pattersonhttp://arxiv.org/abs/2504.20323v1Labeling Case Similarity based on Co-Citation of Legal Articles in Judgment Documents with Empirical Dispute-Based Evaluation2025-04-29T00:26:37ZThis report addresses the challenge of limited labeled datasets for developing legal recommender systems, particularly in specialized domains like labor disputes. We propose a new approach leveraging the co-citation of legal articles within cases to establish similarity and enable algorithmic annotation. This method draws a parallel to the concept of case co-citation, utilizing cited precedents as indicators of shared legal issues. To evaluate the labeled results, we employ a system that recommends similar cases based on plaintiffs' accusations, defendants' rebuttals, and points of disputes. The evaluation demonstrates that the recommender, with finetuned text embedding models and a reasonable BiLSTM module can recommend labor cases whose similarity was measured by the co-citation of the legal articles. This research contributes to the development of automated annotation techniques for legal documents, particularly in areas with limited access to comprehensive legal databases.2025-04-29T00:26:37Z16 pages, 9 figures, 2 tables, the Nineteenth International Workshop on Juris-Informatics (JURISIN 2025), associated with the Seventeenth JSAI International Symposium on AI (JSAI-isAI 2025)Lecture Notes in Artificial Intelligence (volumn number to be added), 2025Chao-Lin LiuPo-Hsien WuYi-Ting Yuhttp://arxiv.org/abs/2504.20125v1Towards Large Language Models for Lunar Mission Planning and In Situ Resource Utilization2025-04-28T13:33:37ZA key factor for lunar mission planning is the ability to assess the local availability of raw materials. However, many potentially relevant measurements are scattered across a variety of scientific publications. In this paper we consider the viability of obtaining lunar composition data by leveraging LLMs to rapidly process a corpus of scientific publications. While leveraging LLMs to obtain knowledge from scientific documents is not new, this particular application presents interesting challenges due to the heterogeneity of lunar samples and the nuances involved in their characterization. Accuracy and uncertainty quantification are particularly crucial since many materials properties can be sensitive to small variations in composition. Our findings indicate that off-the-shelf LLMs are generally effective at extracting data from tables commonly found in these documents. However, there remains opportunity to further refine the data we extract in this initial approach; in particular, to capture fine-grained mineralogy information and to improve performance on more subtle/complex pieces of information.2025-04-28T13:33:37ZMichael PekalaGregory CanalSamuel BarhamMilena B. GrazianoMorgan TrexlerLeslie HamiltonElizabeth ReillyChristopher D. Stileshttp://arxiv.org/abs/2407.10652v2Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews2025-04-28T07:58:34ZSystematic literature reviews (SLRs) are essential but labor-intensive due to high publication volumes and inefficient keyword-based filtering. To streamline this process, we evaluate Large Language Models (LLMs) for enhancing efficiency and accuracy in corpus filtration while minimizing manual effort. Our open-source tool LLMSurver presents a visual interface to utilize LLMs for literature filtration, evaluate the results, and refine queries in an interactive way. We assess the real-world performance of our approach in filtering over 8.3k articles during a recent survey construction, comparing results with human efforts. The findings show that recent LLM models can reduce filtering time from weeks to minutes. A consensus scheme ensures recall rates >98.8%, surpassing typical human error thresholds and improving selection accuracy. This work advances literature review methodologies and highlights the potential of responsible human-AI collaboration in academic research.2024-07-15T12:13:53Z6 pages, 5 figures, 1 tableProceedings of the 16th International EuroVis Workshop on Visual Analytics (EuroVA), 2025Lucas JoosDaniel A. KeimMaximilian T. Fischer10.2312/eurova.20251105http://arxiv.org/abs/2504.18896v1Effect of perceived preprint effectiveness and research intensity on posting behaviour2025-04-26T11:34:57ZOpen science is increasingly recognised worldwide, with preprint posting emerging as a key strategy. This study explores the factors influencing researchers' adoption of preprint publication, particularly the perceived effectiveness of this practice and research intensity indicators such as publication and review frequency. Using open data from a comprehensive survey with 5,873 valid responses, we conducted regression analyses to control for demographic variables. Researchers' productivity, particularly the number of journal articles and books published, greatly influences the frequency of preprint deposits. The perception of the effectiveness of preprints follows this. Preprints are viewed positively in terms of early access to new research, but negatively in terms of early feedback. Demographic variables, such as gender and the type of organisation conducting the research, do not have a significant impact on the production of preprints when other factors are controlled for. However, the researcher's discipline, years of experience and geographical region generally have a moderate effect on the production of preprints. These findings highlight the motivations and barriers associated with preprint publication and provide insights into how researchers perceive the benefits and challenges of this practice within the broader context of open science.2025-04-26T11:34:57Z24 pages, 5 tablesPablo Dorta-GonzálezMaría Isabel Dorta-González