Human-in-the-Loop and AI: Crowdsourcing Metadata Vocabulary for Materials Science

2025-12-10T18:22:57Z

Metadata vocabularies are essential for advancing FAIR and FARR data principles, but their development constrained by limited human resources and inconsistent standardization practices. This paper introduces MatSci-YAMZ, a platform that integrates artificial intelligence (AI) and human-in-the-loop (HILT), including crowdsourcing, to support metadata vocabulary development. The paper reports on a proof-of-concept use case evaluating the AI-HILT model in materials science, a highly interdisciplinary domain Six (6) participants affiliated with the NSF Institute for Data-Driven Dynamical Design (ID4) engaged with the MatSci-YAMZ plaform over several weeks, contributing term definitions and providing examples to prompt the AI-definitions refinement. Nineteen (19) AI-generated definitions were successfully created, with iterative feedback loops demonstrating the feasibility of AI-HILT refinement. Findings confirm the feasibility AI-HILT model highlighting 1) a successful proof of concept, 2) alignment with FAIR and open-science principles, 3) a research protocol to guide future studies, and 4) the potential for scalability across domains. Overall, MatSci-YAMZ's underlying model has the capacity to enhance semantic transparency and reduce time required for consensus building and metadata vocabulary development.

Knowledge Independence Breeds Disruption but Limits Recognition

2025-12-10T15:22:54Z

Despite extensive research on scientific disruption, two questions remain: why disruption has declined amid growing knowledge, and why disruptive work receives fewer and delayed citations. One way to address these questions is to identify an intrinsic, paper-level property that reliably predicts disruption and explains both patterns. Here, we propose a novel measure, knowledge independence, capturing the extent to which a paper draws on references that do not cite one another. Analyzing 114 million publications, we find that knowledge independence strongly predicts disruption and mediates the disruptive advantage of small, onsite, and fresh teams. Its long-term decline, nonreproducible by null models, provides a mechanistic explanation for the parallel decline in disruption. Causal and simulation evidence further indicates that knowledge independence drives the persistent trade-off between disruption and impact. Taken together, these findings fill a critical gap in understanding scientific innovation, revealing a universal law: Knowledge independence breeds disruption but limits recognition.

Investigating the originality of scientific papers across time and domain: A quantitative analysis

2025-12-10T07:57:54Z

The study of creativity in science has long sought quantitative metrics capable of capturing the originality of the scientific insights contained within articles and other scientific works. In recent years, the field has witnessed a substantial expansion of research activity, enabled by advances in natural language processing and network analysis, and has utilised both macro- and micro-scale approaches with success. However, they often do not examine the text itself for evidence of originality. In this paper, we apply a computational measure correlating with originality from creativity science, Divergent Semantic Integration (DSI), to a set of 51,200 scientific abstracts and titles sourced from the Web of Science. To adapt DSI for application to scientific texts, we advance the original BERT method by incorporating SciBERT (a model trained on scientific corpora) into the computation of DSI. In our study, we observe that DSI plays a more pronounced role in the accrual of early citations for papers with fewer authors, varies substantially across subjects and research fields, and exhibits a declining correlation with citation counts over time. Furthermore, by modelling SciBERT- and BERT-DSI as predictors of the logarithm of 5-year citation counts alongside field, publication year, and the logarithm of author count, we find statistically significant relationships, with adjusted R-squared of 0.103 and 0.101 for BERT-DSI and SciBERT-DSI. Because existing scientometric measures rarely assess the originality expressed in textual content, DSI provides a valuable means of directly quantifying the conceptual originality embedded in scientific writing.

Optimizing Data Extraction from Materials Science Literature: A Study of Tools Using Large Language Models

2025-12-10T07:09:02Z

Large Language Models (LLMs) are increasingly utilized for large-scale extraction and organization of unstructured data owing to their exceptional Natural Language Processing (NLP) capabilities. Empowering materials design, vast amounts of data from experiments and simulations are scattered across numerous scientific publications, but high-quality experimental databases are scarce. This study considers the effectiveness and practicality of five representative AI tools (ChemDataExtractor, BERT-PSIE, ChatExtract, LangChain, and Kimi) to extract bandgaps from 200 randomly selected Materials Science publications in two presentations (arXiv and publisher versions), comparing the results to those obtained by human processing. Although the integrity of data extraction has not met expectations, encouraging results have been achieved in terms of precision and the ability to eliminate irrelevant papers from human consideration. Our analysis highlights both the strengths and limitations of these tools, offering insights into improving future data extraction techniques for enhanced scientific discovery and innovation. In conjunction with recent research, we provide guidance on feasible improvements for future data extraction methodologies, helping to bridge the gap between unstructured scientific data and structured, actionable databases.

Sustainable Development Goals in Psychology: A Century of Progress in Publications

2025-12-09T14:14:09Z

The Sustainable Development Goals (SDGs) offer a lens for tracking societal change, yet contributions from the social and behavioral sciences have rarely been integrated into policy agendas. To take stock and create a baseline and benchmark for the future, we assemble 233,061 psychology publications (1894 -- 2022) and tag them to the 17 SDGs using a query-based classifier. Health, education, work, inequality, and gender dominate the study of SDGs in psychology, shifting from an early focus on work to education and inequality, and since the 1960s, health. United States-based research leads across most goals. Other countries set distinct priorities (e.g., China: education and work; Australia: health). Women comprise about one-third of authors, concentrated in social and health goals, but have been underrepresented in STEM-oriented goals. The 2015 launch of the SDGs marked a turning point: SDG-tagged publications have been receiving more citations than comparable non-SDG work, reversing a pre-2015 deficit. Tracking the SDGs through psychology clarifies long-run engagement with social priorities, identifies evidence gaps, and guides priorities to accelerate the field's contribution to the SDG agenda.

Who Are Tweeting About Academic Publications? A Systematic Review and Meta-Analysis of Altmetric Studies

2025-12-06T15:51:20Z

Understanding who shares academic publications on Twitter is critical to interpreting altmetrics as signals of scholarly or societal impact. Prior studies have used diverse and often incompatible user classification schemes, making synthesis difficult. This study presents a systematic review and meta-analysis of 23 empirical studies (covering 79,014 Twitter users, over 20 million tweets, and more than 5 million tweeted publications) to estimate category-specific engagement across three metrics: user counts, tweets, and tweeted publications. We developed a harmonized categorization scheme encompassing 11 user types and applied both Random Effects Models (REM) and Beta-Binomial Hierarchical Models (BBHM) to estimate proportions, account for study-level variation, and model uncertainty. Across all indicators, individual users were the most active, comprising 66% of users, 55% of tweets, and 50% of tweeted publications. BBHM further enabled in-category vs. out-of-category comparisons and revealed engagement differences not detected by REM. T-tests on study-level means confirmed significant differences between academic individuals and other user types. Despite methodological heterogeneity, results consistently show that academic and non-academic individuals statistically equally dominate Twitter engagement with scholarly content. Our findings support the need for standardized user classification schemes and demonstrate the value of Bayesian modeling for synthesizing altmetric data in study variation and sparsity.

Enhancing Information Retrieval in Digital Libraries through Unit Harmonisation in Scholarly Knowledge Graphs

2025-12-06T10:58:17Z

Scientists have always used the studies and research of other researchers to achieve new objectives and perspectives. In particular, employing and operating the measured data in previous studies is so practical. Searching the content of other scientists' articles is a challenge that researchers have always struggled with. Nowadays, the use of knowledge graphs as a semantic database has helped a lot in saving and retrieving scholarly knowledge. Such technologies are crucial to upgrading traditional search systems to smart knowledge retrieval, which is crucial to getting the most relevant answers for a user query, especially in information and knowledge management. However, in most cases, only the metadata of a paper is searchable, and it is still cumbersome for scientists to have access to the content of the papers. In this paper, we present a novel method of faceted search \emph{structured content} for comparing and filtering measured data in scholarly knowledge graphs while different units of measurement are used in different studies. This search system proposes applicable units as facets to the user and would dynamically integrate content from further remote knowledge graphs to materialize the scholarly knowledge graph and achieve a higher order of exploration usability on scholarly content, which can be filtered to better satisfy the user's information needs. The state of the art is that, by using our faceted search system, users can not only search the contents of scientific articles, but also compare and filter heterogeneous data.

Expert-Grounded Automatic Prompt Engineering for Extracting Lattice Constants of High-Entropy Alloys from Scientific Publications using Large Language Models

2025-12-05T17:45:32Z

Large language models (LLMs) have shown promise for scientific data extraction from publications, but rely on manual prompt refinement. We present an expert-grounded automatic prompt optimization framework that enhances LLM entity extraction reliability. Using high-entropy alloy lattice constant extraction as a testbed, we optimized prompts for Claude 3.5 Sonnet through feedback cycles on seven expert-annotated publications. Despite a modest optimization budget, recall improved from 0.27 to > 0.9, demonstrating that a small, expert-curated dataset can yield significant improvements. The approach was applied to extract lattice constants from 2,267 publications, yielding data for 1,861 compositions. The optimized prompt transferred effectively to newer models: Claude 4.5 Sonnet, GPT-5, and Gemini 2.5 Flash. Analysis revealed three categories of LLM mistakes: contextual hallucination, semantic misinterpretation, and unit conversion errors, emphasizing the need for validation protocols. These results establish feedback-guided prompt optimization as a low-cost, transferable methodology for reliable scientific data extraction, providing a scalable pathway for complex LLM-assisted research tasks.

Measuring the Potential of Scientific Literature: A Network-Based Approach to Identifying Paradigm-Shifting Research

2025-12-05T13:49:45Z

This study introduces the Disruption Index as a superior citation-based metric. This index quantitatively assesses the degree to which a publication redirects subsequent scholarly attention away from its preceding literature, thus measuring its novelty and disruptive impact. We tested the D metric's efficacy using a rigorous dataset comprising seminal publications by Nobel Prize winners across Physics, Chemistry, and Physiology or Medicine, benchmarked against control papers with comparable citation counts but non-transformative influence. Our analysis conclusively demonstrates that the D metric effectively distinguishes these prize-worthy, field-redefining works from highly cited but merely incremental research. Furthermore, we explore two contextual variables associated with high disruptive potential: (i) the scale of collaboration (author team size) and (ii) the linguistic structure of the article's title and summary text. The results reveal a strong positive correlation between larger collaborative teams and elevated average D scores, suggesting that extensive collaboration may be a facilitator for generating paradigm shifts. Additionally, publications with high D values tend to feature more expansive titles and greater density of specialized, technical jargon in their abstracts. These findings validate the D metric as a reliable and scalable instrument for both historical and predictive identification of transformative research. They also furnish empirical evidence concerning the team structures and communication patterns that optimize for the production of groundbreaking scientific knowledge.

Big Tech-Funded AI Papers Have Higher Citation Impact, Greater Insularity, and Larger Recency Bias

2025-12-05T13:41:29Z

Over the past four decades, artificial intelligence (AI) research has flourished at the nexus of academia and industry. However, Big Tech companies have increasingly acquired the edge in computational resources, big data, and talent. So far, it has been largely unclear how many papers the industry funds, how their citation impact compares to non-funded papers, and what drives industry interest. This study fills that gap by quantifying the number of industry-funded papers at 10 top AI conferences (e.g., ICLR, CVPR, AAAI, ACL) and their citation influence. We analyze about 49.8K papers, about 1.8M citations from AI papers to other papers, and about 2.3M citations from other papers to AI papers from 1998-2022 in Scopus. Through seven research questions, we examine the volume and evolution of industry funding in AI research, the citation impact of funded papers, the diversity and temporal range of their citations, and the subfields in which industry predominantly acts. Our findings reveal that industry presence has grown markedly since 2015, from less than 2 percent to more than 11 percent in 2020. Between 2018 and 2022, 12 percent of industry-funded papers achieved high citation rates as measured by the h5-index, compared to 4 percent of non-industry-funded papers and 2 percent of non-funded papers. Top AI conferences engage more with industry-funded research than non-funded research, as measured by our newly proposed metric, the Citation Preference Ratio (CPR). We show that industry-funded research is increasingly insular, citing predominantly other industry-funded papers while referencing fewer non-funded papers. These findings reveal new trends in AI research funding, including a shift towards more industry-funded papers and their growing citation impact, greater insularity of industry-funded work than non-funded work, and a preference of industry-funded research to cite recent work.

The Reproducible Research Platform establishes a unified open science environment bridging data and software lifecycles across disciplines, from proposal to publication

2025-12-04T22:02:19Z

Many research groups aspire to make data and code FAIR and reproducible, yet struggle because the data and code life cycles are disconnected, executable environments are often missing from published work, and technical skill requirements hinder adoption. Existing approaches rarely enable researchers to keep using their preferred tools or support seamless execution across domains. To close this gap, we developed the open-source Reproducible Research Platform (RRP), which unifies research data management with version-controlled, containerized computational environments in modular, shareable projects. RRP enables anyone to execute, reuse, and publish fully documented, FAIR research workflows without manual retrieval or platform-specific setup. We demonstrate RRP's impact by reproducing results from diverse published studies, including work over a decade old, showing sustained reproducibility and usability. With a minimal graphical interface focused on core tasks, modular tool installation, and compatibility with institutional servers or local computers, RRP makes reproducible science broadly accessible across scientific domains.

Can ChatGPT evaluate research environments? Evidence from REF2021

2025-12-04T19:13:07Z

UK academic departments are evaluated partly on the statements that they write about the value of their research environments for the Research Excellence Framework (REF) periodic assessments. These statements mix qualitative narratives and quantitative data, typically requiring time-consuming and difficult expert judgements to assess. This article investigates whether Large Language Models (LLMs) can support the process or validate the results, using the UK REF2021 unit-level environment statements as a test case. Based on prompts mimicking the REF guidelines, ChatGPT 4o-mini scores correlated positively with expert scores in almost all 34 (field-based) Units of Assessment (UoAs). ChatGPT's scores had moderate to strong positive Spearman correlations with REF expert scores in 32 out of 34 UoAs: 14 UoAs above 0.7 and a further 13 between 0.6 and 0.7. Only two UoAs had weak or no significant associations (Classics and Clinical Medicine). From further tests for UoA34, multiple LLMs had significant positive correlations with REF2021 environment scores (all p < .001), with ChatGPT 5 performing best (r=0.81; $ρ$=0.82), followed by ChatGPT-4o-mini (r=0.68; $ρ$=0.67) and Gemini Flash 2.5 (r=0.67; $ρ$=0.69). If LLM-generated scores for environment statements are used in future to help reduce workload, support more consistent interpretation, and complement human review then caution must be exercised because of the potential for biases, inaccuracy in some cases, and unwanted systemic effects. Even the strong correlations found here seem unlikely to be judged close enough to expert scores to fully delegate the assessment task to LLMs.

Introducing multiverse analysis to bibliometrics: The case of team size effects on disruptive research

2025-12-04T13:52:42Z

Although bibliometrics has become an essential tool in the evaluation of research performance, bibliometric analyses are sensitive to a range of methodological choices. Subtle choices in data selection, indicator construction, and modeling decisions can substantially alter results. Ensuring robustness (meaning that findings hold up under different reasonable scenarios) is therefore critical for credible research and research evaluation. To address this issue, this study introduces multiverse analysis to bibliometrics. Multiverse analysis is a statistical tool that enables analysts to transparently discuss modeling assumptions and thoroughly assess model robustness. Whereas standard robustness checks usually cover only a small subset of all plausible models, multiverse analysis includes all plausible models. The benefits of multiverse analysis are illustrated by assessing the robustness of the findings reported by Wu et al. (2019), who observed that small teams tend to produce more disruptive research than large teams. While we found robust evidence of a negative effect of team size on disruption scores, the effect size depends substantially on the model specification. Our findings underscore the importance of assessing the multiverse robustness of bibliometric results to clarify their practical implications.

Aging and the Narrowing of Scientific Innovation

2025-12-04T07:55:25Z

With rising life expectancies around the world and an older scientific workforce than ever before, what does aging mean for individual scientists, and what do aging scientists mean for scientific progress as a whole? Here we examine how scientists and scholars age in terms of how their ideas and contributions relate to the evolving frontier of knowledge and how demographically aging fields relate to field-level advance. At the individual level, we examine how research experiences and choices can moderate the effects of intellectual aging. At the collective level, we explore mechanisms that link individual and collective aging. Prior research focuses on star scientists, their changing dates and rates of breakthrough success throughout history. We explore this for scientists in all fields over time, drawing upon novel deep learning measurements that allow us not only to trace positive attention through citation but also negative attention through explicit criticism with a novel, comprehensive database of over 20,000 human-validated critical citations. We find that younger scientists tend toward disruptive contributions that push the frontier, while older scientists engage in combinatorial innovation with an aging collection of components. This includes analyzing the impact of the 1994 U.S. Supreme Court ruling on mandatory retirement and examining how unexpected collaborations affect citation patterns.

"All You Need" is Not All You Need for a Paper Title: On the Origins of a Scientific Meme

2025-12-03T17:36:45Z

The 2017 paper ''Attention Is All You Need'' introduced the Transformer architecture-and inadvertently spawned one of machine learning's most persistent naming conventions. We analyze 717 arXiv preprints containing ''All You Need'' in their titles (2009-2025), finding exponential growth ($R^2$ > 0.994) following the original paper, with 200 titles in 2025 alone. Among papers following the canonical ''X [Is] All You Need'' structure, ''Attention'' remains the most frequently claimed necessity (28 occurrences). Situating this phenomenon within memetic theory, we argue the pattern's success reflects competitive pressures in scientific communication that increasingly favor memorability over precision. Whether this trend represents harmless academic whimsy or symptomatic sensationalism, we leave-with appropriate self-awareness-to the reader.