https://arxiv.org/api/ZCcFpMBklwOSQlJMAtK2n57SBh0 2026-06-13T21:59:46Z 6065 495 15 http://arxiv.org/abs/2511.10354v1 Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates 2025-11-13T14:29:51Z

Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

2025-11-13T14:29:51Z 46 pages Andrea Schimmenti Valentina Pasqual Fabio Vitali Marieke van Erp http://arxiv.org/abs/2511.13747v1 US Code growth 1991-2025 2025-11-13T13:59:38Z

This is the first scientific article since 2010 counting the words which are effective and permanent federal law in the United States (US) Code. The latest version of the US Code --published in 2025-- is the largest since 1991, encompassing over 24.4 million words. The low since 1991 was 1993 at roughly 15 million words. The word count grew in 30 out of 33 years. The average characters per word --potentially indicative of complexity-- grew even in all 33 years. The research also touches upon why it is undesirable for laws to be longer and longer and mentions possible remedies.

2025-11-13T13:59:38Z CM presented this research project at the 2025 Benedict College International Multidisciplinary Conference on 2025-03-12 Christopher Mantzaris Ajda Fošner http://arxiv.org/abs/2507.01651v2 A Dynamical Cartography of the Epistemic Diffusion of Artificial Intelligence in Neuroscience 2025-11-12T17:14:56Z

Neuroscience and AI have an intertwined history, largely relayed in the literature of both fields. In recent years, due to the engineering orientations of AI research and the monopoly of industry for its large-scale applications, the mutual expansion of neuroscience and AI in fundamental research seems challenged. In this paper, we bring some empirical evidences that, on the contrary, AI and neuroscience are continuing to grow together, but with a pronounced interest in the fields of study related to neurodegenerative diseases since the 1990s. With a temporal knowledge cartography of neuroscience drawn with advanced document embedding techniques, we draw the dynamical shaping of the discipline since the 1970s and identified the conceptual articulation of AI with this particular subfield mentioned before. However, a further analysis of the underlying citation network of the studied corpus shows that the produced AI technologies remain confined in the different subfields and are not transferred from one subfield to another. This invites us to discuss the genericity capability of AI in the context of an intradisciplinary development, especially in the diffusion of its associated metrology.

2025-07-02T12:24:44Z Sylvain Fontaine http://arxiv.org/abs/2502.18807v7 BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction 2025-11-12T13:01:43Z

Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.5 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 59 chemical systems, 9 operating temperatures, and 421 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in various neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at https://github.com/Ruifeng-Tan/BatteryLife.

2025-02-26T04:21:20Z Accepted by KDD 2025. Typos and data statistics mistakes are fixed Ruifeng Tan Weixiang Hong Jiayue Tang Xibin Lu Ruijun Ma Xiang Zheng Jia Li Jiaqiang Huang Tong-Yi Zhang http://arxiv.org/abs/2511.09248v1 SciCom Wiki: A Digital Library to Support the Science Communication Knowledge Infrastructure for Videos and Podcasts 2025-11-12T12:10:46Z

Videos and Podcasts have established themselves as the medium of choice for civic dissemination, but also as carriers of misinformation. The emerging Science Communication Knowledge Infrastructure (SciCom KI), which curates these increasingly non-textual media, remains fragmented and inadequately equipped to scale against the content flood. Our work sets out to support the SciCom KI with a central, collaborative platform, the SciCom Wiki, to facilitate FAIR (findable, accessible, interoperable, reusable) media representation, particularly for videos and podcasts. We survey requirements from 53 stakeholders and individually refine these insights in 11 interviews. We then design and implement an open-source service system centered on Wikibase and evaluate our prototype with another 14 participants. Overall, our findings identified several needs to support the SciCom KI systematically. Our SciCom Wiki approach was found suitable to address the raised requirements. Further, we identified that the SciCom KI is severely underdeveloped regarding FAIR knowledge and related systems facilitating its collaborative creation and curation. Our system can provide a central knowledge node similar to Wikidata, yet a collaborative effort is required to scale the necessary features against the imminent (mis-)information flood.

2025-11-12T12:10:46Z 10 pages, 10 figures, accepted at JCDL 2025 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Tim Wittenborg Niklas Stehr Oliver Karras Sören Auer 10.1109/JCDL67857.2025.00011 http://arxiv.org/abs/2511.07790v1 CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis 2025-11-11T03:13:17Z

Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .

2025-11-11T03:13:17Z Peer reviewed and accepted at JCDL 2025, 16 pages, 7 figures Rochana R. Obadage Sarah M. Rajtmajer Jian Wu http://arxiv.org/abs/2511.07168v1 LEAD: LLM-enhanced Engine for Author Disambiguation 2025-11-10T15:00:39Z

Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversità, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.

2025-11-10T15:00:39Z Giusy Giulia Tuccari Lorenzo Giammei Andrea Giovanni Nuzzolese Misael Mongiovì Antonio Zinilli Francesco Poggi http://arxiv.org/abs/2511.07491v1 Quantifying the Impact of CU: A Systematic Literature Review 2025-11-10T12:46:43Z

Community Unionism has served as a pivotal concept in debates on trade union renewal since the early 2000s, yet its theoretical coherence and political significance remain unresolved. This article investigates why CU has gained such prominence -- not by testing its efficacy, but by mapping how it is constructed, cited, and contested across the scholarly literature. Using two complementary systematic approaches -- a citation network analysis of 114 documents and a thematic review of 18 core CU case studies -- I examine how CU functions as both an empirical descriptor and a normative ideal. The analysis reveals CU's dual genealogy: positioned by British scholars as an indigenous return to historic rank-and-file practices, yet structurally aligned with transnational social movement unionism. Thematic coding shows near-universal emphasis on coalition-building and alliances, but deep ambivalence toward class politics. This tension suggests CU's significance lies less in operationalising a new union model, and more in managing contradictions -- between workplace and community, leadership and rank-and-file, reform and radicalism -- within a shrinking labour movement.

2025-11-10T12:46:43Z 14 pages, 2 figures, 2 tables Thomas Compton http://arxiv.org/abs/2511.07490v1 Have we reached the beginning of the end for review papers? 2025-11-10T12:17:51Z

Review papers have traditionally enjoyed a high status in academic publishing because of the important role they can play in summarising and synthesising a field of research. They can also attract significantly more citations than primary research papers presenting original research, making them attractive to authors. There has been a dramatic increase in the publication of review papers in recent years, both in raw numbers and as a proportion of overall publication output. In this paper we demonstrate this increase across a wide range of fields of study. We quantify the citation dividend associated with review papers, but also demonstrate that it is declining and discuss the reasons for this decline. We further show that, since the arrival of GenAI tools in 2022 there is evidence of widespread use of GenAI in research paper writing, and we present evidence for a stronger AI signal among review papers compared to primary research papers. We suggest that the potential for GenAI to accelerate and even automate the production review papers will have a further significant impact on their status.

2025-11-10T12:17:51Z 5 figures, 15 pages excluding Appendices Barry Smyth Padraig Cunningham http://arxiv.org/abs/2511.01675v2 Incorrect Citation Association for Articles in Online-Only Springer Nature Journals 2025-11-10T11:04:42Z

We show that citation metrics of journal articles in many of the online-only Springer Nature journals and associated ones are distorted, going back to articles from 2001. We find that most likely due to an API response error, there are many incorrect references which typically lead to Article Number 1 of a given Volume. Among others, the issue affects journals such as Scientific Reports, Nature Communications, Communications journals, Cell Death & Disease, Light: Science & Applications, as well as many BMC, Discovery and npj journals. Beyond the negative effect of introducing incorrect reference information, this distorts the citation statistics of articles in these journals, with a few articles being massively over-cited compared to their peers, while many lose citations; e.g. both in Scientific Reports and in Nature Communications, 5 of the 10 top cited articles have article numbers of 1. We validate the distorted statistics by assessing data from multiple scientific literature databases: Crossref, OpenCitations, Semantic Scholar, and the journals' websites. The issue primarily arises from the inconsistent transition from page-based referencing of articles to article number-based referencing, as well as the improper handling of the change in the publisher's article metadata API. It seems that the most pressing problem has been present since approximately 2011, which we estimate affects the citation count of millions of authors.

2025-11-03T15:41:04Z Cross-posted to MetaArXiv Tamás Kriváchy http://arxiv.org/abs/2511.06688v1 Accessibility Gaps in U.S. Government Dashboards for Blind and Low-Vision Residents 2025-11-10T04:20:46Z

Public dashboards are now a common way for US government agencies to share high stakes information with residents. We audited six live systems at federal, state, and city levels: CDC respiratory illness, HUD homelessness PIT and HIC, California HCD Annual Progress Report, New York City Mayor's Management Report, Houston Permitting, and Chicago public health and budget dashboards. Using a rubric based on screen reader needs and WCAG, we checked five items: (1) discoverability of key metrics by assistive tech, (2) keyboard access without mouse hover, (3) clear semantic labels for axes, series, and categories, (4) short plain language status and trend notes, and (5) machine readable tables or CSVs that mirror what sighted users see. Findings are mixed. Many charts fail basic discoverability or depend on hover, which blocks keyboard and screen reader use. Plain language summaries are common in CDC and Chicago, but rare in HUD and Houston. Machine readable data is strong for NYC, California, and HUD; it is weaker or unclear for Houston. Several sites promise service for the public or for customers yet do not name accessibility in their descriptions. Across systems we also observe urgency inversion: faster, operational dashboards tend to provide fewer accessible affordances than slower accountability dashboards. These patterns matter for equal participation and for ADA Title II compliance that references WCAG 2.1 AA. We propose three steps for any public dashboard: add a brief status and trend text at the same update cadence, publish a matching table or CSV of the visual metrics, and state an explicit accessibility commitment.

2025-11-10T04:20:46Z Preprint. Accessibility audit of six U.S. public dashboard ecosystems; 1 figure, 2 tables Chadani Acharya http://arxiv.org/abs/2510.21838v4 A Quantitative Approach to Estimating Bias, Favouritism and Distortion in Scientific Journalism 2025-11-09T13:59:21Z

While traditionally not considered part of the scientific method, science communication is increasingly playing a pivotal role in shaping scientific practice. Researchers are now frequently compelled to publicise their findings in response to institutional impact metrics and competitive grant environments. This shift underscores the growing influence of media narratives on both scientific priorities and public perception. In a current trend of personality-driven reporting, we examine patterns in science communication that may indicate biases of different types, towards topics and researchers. We focused and applied our methodology to a corpus of media coverage from three of the most prominent scientific media outlets: Wired, Quanta, and The New Scientist -- spanning the past 5 to 10 years. By mapping linguistic patterns, citation flows, and topical convergence, our objective was to quantify the dimensions and degree of bias that influence the credibility of scientific journalism. In doing so, we seek to illuminate the systemic features that shape science communication today and to interrogate their broader implications for epistemic integrity and public accountability in science. We present our results with anonymised journalist names but conclude that personality-driven media coverage distorts science and the practice of science flattening rather than expanding scientific coverage perception. Keywords : selective sourcing, bias, scientific journalism, Quanta, Wired, New Scientist, fairness, balance, neutrality, standard practices, distortion, personal promotion, communication, media outlets.

2025-10-22T13:21:45Z Raghavendra Koushik Hector Zenil http://arxiv.org/abs/2511.07474v1 A national study of postdoctoral research fellows in South Africa 2025-11-08T18:27:59Z

This report provides the first comprehensive analysis of postdoctoral research fellows (postdocs) in South African public universities. It combines an analysis of existing data with the analysis of primary data collected in the form of a survey of institutions on the postdocs they host, a bibliometric study of the research output of postdocs, and an individual survey of postdocs. The number of postdocs has been increasing steadily from 2016 to 2022 and varies across universities, with larger research-intensive universities hosting more postdocs. In terms of demographics, the proportion of black African postdocs has increased; the proportion of female postdocs has remained lower than that of males; there is an increasing proportion of older postdocs; and more than 60 percent of postdocs are foreign-born. The bibliometric analysis of the publication output of postdocs shows that it increased substantially from 2005 to 2022. Some main results of the individual survey are that a postdoc position is taken primarily to enhance prospects for employment in a permanent academic position. However, securing such positions is reported as challenging, which is supported by results that one in every four postdocs has held multiple consecutive postdoc positions, and postdocs in general, but especially non-South Africans, perceive the job market as poor. Postdocs plan to leave South Africa primarily to seek better job opportunities, but also due to immigration rules or visa issues, which constitute major challenges for non-South Africans. Most postdocs desire to contribute to teaching and supervision but often lack the opportunity to do so. Dissatisfaction stems mostly from low levels of remuneration, difficulties created by the precarious nature of their positions and a lack of support for training and career development in their hosting institutions.

2025-11-08T18:27:59Z 119 pages, 43 figures. Commissioned research report submitted by the DSI-NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy to the South African National Research Foundation, February 2024 Heidi Prozesky Francois van Schalkwyk Johann Mouton http://arxiv.org/abs/2510.09172v2 Generating CodeMeta through declarative mapping rules: An open-ended approach using ShExML 2025-11-07T16:16:03Z

Nowadays, software is one of the cornerstones when conducting research in several scientific fields which employ computer-based methodologies to answer new research questions. However, for these experiments to be completely reproducible, research software should comply with the FAIR principles, yet its metadata can be represented following different data models and spread across different locations. In order to bring some cohesion to the field, CodeMeta was proposed as a vocabulary to represent research software metadata in a unified and standardised manner. While existing tools can help users to generate CodeMeta files for some specific use cases, they fall short on flexibility and adaptability. Hence, in this work, I propose the use of declarative mapping rules to generate CodeMeta files, illustrated through the implementation of three crosswalks in ShExML which are then expanded and merged to cover the generation of CodeMeta files for two existing research software artefacts. Moreover, the outputs are validated using SHACL and ShEx and the whole generation workflow is automated requiring minimal user intervention upon a new version release. This work can, therefore, be used as an example upon which other developers can include a CodeMeta generation workflow in their repositories, facilitating the adoption of CodeMeta and, ultimately, increasing research software FAIRness.

2025-10-10T09:15:08Z Submitted to Scientific Data Herminio García-González http://arxiv.org/abs/2511.05211v1 Mapping Research Productivity of BRICS Countries with Special Reference to Coronary Artery Disease (CAD): A Scientometric Study 2025-11-07T12:59:36Z

This study presents a comprehensive scientometric analysis of research productivity on Coronary Artery Disease (CAD) among the BRICS countries, Brazil, Russia, India, China, and South Africa, using data retrieved from the Web of Science database for the period 1990 to 2019. A total of 50,036 records were analyzed to assess publication growth trends, authorship patterns, collaboration levels, and citation impact. The findings reveal a steady increase in CAD-related publications, with China emerging as the leading contributor, followed by Brazil, Russia, India, and South Africa. English dominated as the primary language of communication, accounting for over 93% of publications. Authorship and collaboration analysis indicate a high degree of joint research, with 97.91% of studies being co-authored and a degree of collaboration of 0.98, underscoring the collective nature of scientific inquiry in this domain. The study validates the applicability of Lotkas Law for author productivity, Bradfords Law for journal distribution, and Zipfs Law for keyword frequency, while the Price Square Root Law was found inapplicable. The predominant publication format was journal articles (79.7%), and Kardiologiya (Russia) emerged as the most prolific journal. The results demonstrate significant growth in CAD research output and collaboration within BRICS, though notable disparities persist among member nations. The study recommends enhancing individual author productivity, expanding international collaboration, and supporting CAD research through strategic institutional and governmental initiatives. These findings provide valuable insights for policymakers, funding agencies, and the academic community to strengthen cardiovascular research capacity within developing economies.

2025-11-07T12:59:36Z 260 Pages, 21 figures, PhD Thesis 2020 Annamalai University; Published; http://hdl.handle.net/10603/460776; 2022 Muneer Ahmad