Robust Evidence for Declining Disruptiveness: Assessing the Role of Zero-Backward-Citation Works

2026-03-19T23:54:51Z

We respond to Holst et al.'s critique that the decline in scientific disruptiveness documented in Park et al. (Nature, 2023) is an artifact of including works with zero backward citations. Using their advocated dataset, metric, and exclusion criteria, we find declines equivalent to major benchmark transformations in science. Their own regression model--designed to address their concerns about zero-citation works--yields large and significant declines for both papers and patents (p<0.001), a result found in their supplementary tables yet left unaddressed, despite directly contradicting their central claim. Their critique is further undermined by severe quality issues in their data, which contain three times more zero-citation works than ours. We trace this excess to their inclusion of at least 2.8 million editorials, obituaries, and comments, 1.5 million books and proceedings, and 254,000 product and artistic reviews--in all, 20% of their sample is non-research content that almost by definition lacks backward citations. Simple keyword searches confirm the problem's severity, identifying among others 456 For Dummies guides, 50 Dr. Seuss and Curious George books, and the Captain Underpants series--all zero-citation entries in their sample. Applying granular document type classification to their data reveals that such non-research content fell from 40% to 8% of their sample between 1945 and 2010--a shift sufficient to generate the decline in zero-citation prevalence they attribute to metadata errors in our study. Standard practice excludes such content to guard against the metadata quality concerns at the center of their critique--concerns their dataset exemplifies rather than addresses. Declining disruptiveness has been documented in nearly 100 studies across multiple databases, metrics, and non-citation-based measures. The weight of evidence does not support an artifact-based explanation.

Review and Analysis of Scientific Paper Embellishments

2026-03-19T14:49:30Z

We present a review and analysis of scientific paper embellishments -- simple visual elements that are deeply integrated into the text of scientific publications. These embellishments are increasingly used in research papers, which have the potential to enhance textual descriptions, strengthen connections between figures and content, and improve internal textual coherence, while also carrying the risk of disrupting the reading experience. As their exact impact is not yet well understood, we conducted a systematic review of all visualization papers published between 2019 and 2024 in IEEE VIS, ACM CHI, and EuroVis. From this corpus, we identified 374 papers that used paper embellishments and distilled three key dimensions that characterize their usage: purposes (WHY), design choices (HOW), and locations (WHERE) of paper embellishments. Our findings provide a structured perspective on the form of current embellishments in scientific writing in the visualization domain and provide insights into their role in shaping scientific communication.

A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering

2026-03-18T15:32:25Z

The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and Systems Engineering (EDSE). While this "digital thread" has the potential to drive innovation, the fragmented and inaccessible nature of existing datasets hinders method validation, limits reproducibility, and slows research progress. Unlike fields such as computer vision and natural language processing, which benefit from established benchmark ecosystems, engineering design research often relies on small, proprietary, or ad-hoc datasets. This paper addresses this challenge by proposing a systematic framework for a "Map of Datasets in EDSE." The framework is built upon a multi-dimensional taxonomy designed to classify engineering datasets by domain, lifecycle stage, data type, and format, enabling faceted discovery. An architecture for an interactive discovery tool is detailed and demonstrated through a working prototype, employing a knowledge graph data model to capture rich semantic relationships between datasets, tools, and publications. An analysis of the current data landscape reveals underrepresented areas ("data deserts") in early-stage design and system architecture, as well as relatively well-represented areas ("data oases") in predictive maintenance and autonomous systems. The paper identifies key challenges in curation and sustainability and proposes mitigation strategies, laying the groundwork for a dynamic, community-driven resource to accelerate data-centric engineering research.

Benchmarking Cross-Scale Perception Ability of Large Multimodal Models in Material Science

2026-03-18T07:51:20Z

Unraveling the hierarchical structure-property relationships is the central challenge of materials science, necessitating the interpretation of data across vast physical scales from micro to macro. Despite the rapid integration of Large Multimodal Models (LMMs) into scientific workflows, existing scientific benchmarks primarily focus on general chart interpretation or isolated common-sense reasoning, failing to capture reasoning ability across intricate physical dimensions. To address this, we introduce CSMBench, a dataset comprising 1,041 high-quality figures curated from premier journals up to September 2025. CSMBench categorizes data into four scientifically distinct regimes: atomic, micro, meso, and macro scales, strictly aligning with the focus and definitions in materials study. Through open-ended figure description and multiple-choice caption matching tasks, we evaluate state-of-the-art open-source and closed-source models. Our analysis identifies that performance varies significantly across physical scales due to the distinct visual characteristics, highlighting the limitations of current generalist models and identifying critical directions for achieving hierarchical and accurate understanding in materials research. The CSMBench is publicly released at: https://huggingface.co/datasets/lututu/CSMBench.

citecheck: An MCP Server for Automated Bibliographic Verification and Repair in Scholarly Manuscripts

2026-03-18T04:10:31Z

Reference lists in scholarly manuscripts frequently contain errors, including incorrect identifiers, incomplete metadata, misattributed authors, and mismatches between preprint and published versions. These problems are tedious to repair manually and have become more visible in workflows that rely on large language models, which can fabricate or corrupt citations. We present citecheck, a TypeScript system and MCP server for automated bibliographic verification and repair in paper-like project folders. Given a manuscript file or workspace, citecheck selects the most likely paper artifact, extracts references from .bib, .tex, .md, .txt, or .docx, validates entries against PubMed, Crossref, arXiv, and Semantic Scholar, and returns structured correction proposals together with replacement-safety diagnostics. The current repository provides a working research prototype with multi-pass retrieval, manifestation-aware matching, policy-gated rewrite planning, and 47 passing tests covering repair behavior, malformed payload handling, transport failures, and MCP exposure. We position citecheck as infrastructure for agentic scholarly editing and as a practical guardrail against both traditional reference errors and LLM-induced citation hallucinations.

APCs and citation impact of Gold OA articles authored by Ukrainian scholars before and during Russia's full-scale war against Ukraine (2020-2023)

2026-03-17T17:41:27Z

This study first examines how APC expenditures, authorship patterns, and publishing venues of Ukrainian scholars changed between the pre-war (2020-2021) and wartime (2022-2023) periods. Second, it explores the extent to which APC levels are associated with the field-normalized citation impact (FNCI) of Gold Open Access articles authored by Ukrainian scholars. Statistical analysis revealed a small but significant correlation between APC amounts and citation impact, though the effect size was minimal, suggesting higher APCs did not substantially boost citations. APC waivers offered by major publishers such as Springer and Elsevier since 2022 resulted in only a slight increase in the number of articles authored solely by Ukrainian scholars. Despite these waivers, MDPI and Aluna maintained the largest shares. Between 2020 and 2023, the number of articles authored solely by Ukrainian scholars in foreign journals fell by 25.7 percent, and total APC spending declined by 24.6 percent, from 1.24 million EUR to 0.93 million EUR. Medicine accounted for the largest share of both articles and APC expenditure, with the majority published in Aluna journals.

Organisational accounts engaged in scholarly communication on Twitter: Patterns of presence, activity and engagement

2026-03-17T15:09:03Z

Organisational accounts are an integral part of the Twitter (now X) ecosystem. This study identified 9,842 research- and policy-related organisational accounts that had tweeted about scholarly publications by linking three global organisational databases (GRID, ROR, and Overton) with two altmetric databases containing Twitter data (Altmetric and the former Crossref Event Data). The resulting openly available dataset was used to examine organisational activity in scholarly communication across three dimensions: social media capital, tweeting activity, and engagement level. The results show that, compared to all Twitter users engaged in scholarly communication, organisational accounts hold a notable advantage in terms of follower bases and the proportion of scholarly tweets. Their scholarly tweets achieve high visibility through likes and retweets but perform weakly in generating more conversational forms of engagement, such as quotes and replies. Distinct patterns emerge across organisational categories: research facilities, in particular, demonstrate the strongest focus on scholarly tweeting, whereas government accounts are comparatively more successful in eliciting engagement across all metrics, including the more interactive ones. This study contributes both an open dataset of organisational accounts and a methodological framework for their identification, while also highlighting the important roles that organisations play in shaping scholarly discourse on social media.

A Formalization of the Ionescu-Tulcea Theorem in Mathlib

2026-03-17T10:11:02Z

We describe the formalization of the Ionescu-Tulcea theorem, showing the existence of a probability measure on the space of trajectories of a Markov chain, in the proof assistant Lean using the integrated library Mathlib. We first present a mathematical proof before exposing the difficulties which arise when trying to formalize it, and how they were overcome. We then build on this work to formalize the construction of the product of an arbitrary family of probability measures.

Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges

2026-03-16T17:35:20Z

Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.

Have Large Language Models Enhanced the Way Civil & Environmental Engineers Write? A Quantitative Analysis of Scholarly Communication over 25 Years

2026-03-16T15:16:52Z

Large language models (LLMs) have rapidly emerged in civil and environmental engineering (CEE) research, education, and practice as tools for project ideation, execution, and communication. However, it is unknown how prevalent LLM adoption is across CEE scholarship and whether it measurably alters research prose. Inspired by recent analyses of biomedical research, this study uses a vocabulary-based frequency-shift methodology to detect linguistic signals of LLM-assisted writing in a large corpus of CEE literature. A total of 149,452 abstracts published by the American Society of Civil Engineers from 2000 through 2025 are analyzed to quantify deviations from long-term vocabulary trends. Prior to the introduction of LLMs in 2022, CEE publications exhibit long-term trends toward longer abstracts and sentences, greater use of segmenting punctuation, higher required reading levels, and a shift toward active, first-person verb constructions. Beginning around 2023, however, the frequencies of many stylistic marker words (e.g., enhance) sharply depart from historical trajectories, accompanied by deviations in multiple semantic properties. Abstracts classified as likely LLM-assisted exhibit increased lexical diversity, comma use, and complexity, with reduced passive voice and hedging language, producing prose that is more segmented, complex, and confident. The AI contribution of this study lies in the use of natural language processing to identify population-level linguistic signals of LLM-assisted text, applied to quantify the prevalence of LLM use and its influence on the vocabulary, structure, and tone of engineering scholarly writing. Together, these findings provide the first large-scale, data-driven assessment of how LLMs are beginning to reshape scholarly communication in CEE.

Exploring Novelty Differences between Industry and Academia: A Knowledge Entity-centric Perspective

2026-03-16T14:41:51Z

Academia and industry each possess distinct advantages in advancing technological progress. Academia's core mission is to promote open dissemination of research results and drive disciplinary progress. The industry values knowledge appropriability and core competitiveness, yet actively engages in open practices like academic conferences and platform sharing, creating a knowledge strategy paradox. Highly novel and publicly accessible knowledge serves as the driving force behind technological advancement. However, it remains unclear whether industry or academia can produce more novel research outcomes. Some studies argue that academia tends to generate more novel ideas, while others suggest that industry researchers are more likely to drive breakthroughs. Previous studies have been limited by data sources and inconsistent measures of novelty. To address these gaps, this study conducts an analysis using four types of fine-grained knowledge entities (Method, Tool, Dataset, Metric), calculates semantic distances between entities within a unified semantic space to quantify novelty, and achieves comparability of novelty across different types of literature. Then, a regression model is constructed to analyze the differences in publication novelty between industry and academia. The results indicate that academia demonstrates higher novelty outputs, which is particularly evident in patents. At the entity level, both academia and industry emphasize method-driven advancements in papers, while industry holds a unique advantage in datasets. Additionally, academia-industry collaboration has a limited effect on enhancing the novelty of research papers, but it helps to enhance the novelty of patents. We release our data and associated codes at https://github.com/tinierZhao/entity_novelty.

Which stylistic features fool ChatGPT research evaluations?

2026-03-16T07:25:46Z

Large Language Models (LLMs) have the potential to be used to support research evaluation and have a moderate capability to estimate the research quality of a journal article from its title and abstract. This paper assesses whether there are language-related factors unrelated to the quality of the research that influence ChatGPT's scores. Using a dataset of 99,277 journal articles submitted to the UK-wide Research Excellence Framework (REF) 2021 assessments, we calculated several readability indicators from abstracts and correlated them with ChatGPT scores and departmental REF scores. From the results, linguistic complexity and length were more strongly associated with ChatGPT research quality scores than with REF expert scores in many subject areas. Although cause-and-effect was not tested, these results suggest that ChatGPT may be more likely than human experts to reward linguistic complexity, with a potential bias towards longer and less readable abstracts in many fields. The apparent preference of LLMs for complex language is an undesirable feature for practical applications of LLMs for research quality evaluation, unless solutions can be found.

Can Large Language Models Evaluate Grant Proposal Quality? Revisiting the Wennerås and Wold Peer Review Data

2026-03-15T19:34:54Z

Purpose: Despite the importance of peer review for grant funding decisions, academics are often reluctant to conduct it. This can lead to long delays between submission and the final decision as well as the risk of substandard reviews from busy or non-specialist scholars. At least one funder now uses Large Language Models (LLMs) to reduce the reviewing burden but the accuracy of LLMs for scoring grant proposals needs to be assessed. Design/methodology/approach: This article compares scores from a range of medium sized open weights LLMs with peer review scores for a well-researched dataset, the Swedish Medical Council's post-doctoral fellowship applications from 1994. Findings: Whilst the LLM scores correlate moderately between each other (mean Spearman correlation: 0.34), they correlated weakly but positively and mostly statistically significantly with the average expert scores (mean Spearman correlation: 0.22). The highest rank correlation between expert scores and LLMs was 0.33 for Gemma 3 27b based on proposal titles and summaries without their main texts, which is about half (56%) of the correlation between reviewers. Research limitations: The small sample size, old funding call and heterogeneous evaluation criteria all undermine the robustness of the analysis. Practical implications: Despite the ability of LLMs to score grant proposals being quantitatively weaker than that of experts, at least in this special case, they may have role in application triage or tie-breaking. Originality/value: This is the first assessment of the value of LLM scores for funding proposals.

Researcher Population Pyramids: Tracking Demographic and Gender Trajectories Across Countries

2026-03-15T12:49:02Z

The sustainability of the academic ecosystem relies on researcher demographics and gender balance, yet assessing these dynamics in a timely manner for policy is challenging. Here, we propose a researcher population pyramid framework for tracking demographic and gender trajectories across countries using publication data. We provide a timely snapshot of historical and present demographics and gender balance across 58 countries, revealing three contrasting patterns among research systems: Emerging systems (e.g., Arab countries) exhibit high researcher inflows with widening gender gaps in cumulative productivity; Mature systems (e.g., the United States) show modest inflows with narrowing gender gaps; and Rigid systems (e.g., Japan) lag in both. Furthermore, by simulating future scenarios, the framework makes potential trajectories visible. If 2023 demographic patterns persist, Arab countries' systems could resemble mature or even rigid ones by 2050. Our framework provides a robust diagnostic tool for policymakers worldwide to foster sustainable talent pipelines and gender equality in academia.

Rising Prevalence of Detected AI-Generated Text in Medical Literature: Longitudinal Analysis in Open Access Articles

2026-03-15T02:49:04Z

Generative artificial intelligence (AI) tools are becoming increasingly used for writing tasks. However, the extent of their use in peer-reviewed medical literature remains unclear. We conducted a longitudinal analysis of all Original Investigations, Research Letters, and Invited Commentaries published in JAMA Network Open from January 2022 through March 2025. The main body text of 7,251 articles was analyzed using a commercial AI-detection tool (Originality.AI) to estimate the probability that manuscripts contained a significant amount of AI-generated content. Articles were analyzed aggregated by month, publication type, and domain. Overall, 195 articles (2.7%) were classified as containing significant AI-generated text. The monthly proportion increased from 0.0% in January 2022 to 11.3% in March 2025, with a significant upward trend over time (P<0.001). Invited Commentaries had the highest proportion of detected AI-generated content (6.7%), followed by Original Investigations (2.2%) and Research Letters (1.4%). There was also significant variation across publication domain (P=0.04). Only 15 articles (0.2%) disclosed large language model use, of which 40.0% were classified as containing AI-generated text. While findings suggest increasing detectable AI-generated content in medical literature, limitations of current detection tools necessitates cautious interpretation.