https://arxiv.org/api/3BGvavpHkUKzKZk+x67jZ/SgnMQ 2026-06-14T14:29:58Z 6065 720 15 http://arxiv.org/abs/2507.01520v1 A bibliometric analysis on the current situation and hot trends of the impact of microplastics on soil based on CiteSpace 2025-07-02T09:25:09Z

This paper aims to comprehensively grasp the research status and development trends of soil microplastics (MPs). It collects studies from the Web of Science Core Collection covering the period from 2013 to 2024. Employing CiteSpace and VOSviewer, the paper conducts in - depth analyses of literature regarding the environmental impacts of microplastics. These analyses involve keyword co - occurrence, clustering, burst term identification, as well as co - occurrence analysis of authors and institutions. Microplastics can accumulate in soil, transfer through food chains, and ultimately affect human health, making the research on them essential for effective pollution control. Focusing on the international research on the impacts of microplastics on soil and ecosystems, the study reveals a steadily increasing trend in the number of publications each year, reaching a peak of 956 articles in 2024. A small number of highly productive authors contribute significantly to the overall research output. The keyword clustering analysis results in ten major clusters, including topics such as plastic pollution and microbial communities. The research on soil microplastics has evolved through three distinct stages: the preliminary exploration phase from 2013 to 2016, the expansion phase from 2017 to 2020, and the integration phase from 2021 to 2024. For future research, multi - level assessments of the impacts of microplastics on soil ecosystems and organisms should be emphasized, in order to fully uncover the associated hazards and develop practical solutions.

2025-07-02T09:25:09Z Yiran Zheng Yue Quan Su Yan Xinting Lv Yuguanmin Cao Minjie Fu Mingji Jin http://arxiv.org/abs/2507.01228v1 The hunt for research data: Development of an open-source workflow for tracking institutionally-affiliated research data publications 2025-07-01T22:59:41Z

The ability to find data is central to the FAIR principles underlying research data stewardship. As with the ability to reuse data, efforts to ensure and enhance findability have historically focused on discoverability of data by other researchers, but there is a growing recognition of the importance of stewarding data in a fashion that makes them FAIR for a wide range of potential reusers and stakeholders. Research institutions are one such stakeholder and have a range of motivations for discovering data, specifically those affiliated with a focal institution, from facilitating compliance with funder provisions to gathering data to inform research data services. However, many research datasets and repositories are not optimized for institutional discovery (e.g., not recording or standardizing affiliation metadata), which creates downstream obstacles to workflows designed for theoretically comprehensive discovery and to metadata-conscious data generators. Here I describe an open-source workflow for institutional tracking of research datasets at the University of Texas at Austin. This workflow comprises a multi-faceted approach that utilizes multiple open application programming interfaces (APIs) in order to address some of the common challenges to institutional discovery, such as variation in whether affiliation metadata are recorded or made public, and if so, how metadata are standardized, structured, and recorded. It is presently able to retrieve more than 4,000 affiliated datasets across nearly 70 distinct platforms, including objects without DOIs and objects without affiliation metadata. However, there remain major gaps that stem from suboptimal practices of both researchers and data repositories, many of which were identified in previous studies and which persist despite significant investment in efforts to standardize and elevate the quality of datasets and their metadata.

2025-07-01T22:59:41Z 66 pages, 12 figures, 10 tables Journal of eScience Librarianship 15 (2026) e1170 Bryan M. Gee 10.7191/jeslib.1170 http://arxiv.org/abs/2507.00961v1 Digital Collections Explorer: An Open-Source, Multimodal Viewer for Searching Digital Collections 2025-07-01T17:10:34Z

We present Digital Collections Explorer, a web-based, open-source exploratory search platform that leverages CLIP (Contrastive Language-Image Pre-training) for enhanced visual discovery of digital collections. Our Digital Collections Explorer can be installed locally and configured to run on a visual collection of interest on disk in just a few steps. Building upon recent advances in multimodal search techniques, our interface enables natural language queries and reverse image searches over digital collections with visual features. This paper describes the system's architecture, implementation, and application to various cultural heritage collections, demonstrating its potential for democratizing access to digital archives, especially those with impoverished metadata. We present case studies with maps, photographs, and PDFs extracted from web archives in order to demonstrate the flexibility of the Digital Collections Explorer, as well as its ease of use. We demonstrate that the Digital Collections Explorer scales to hundreds of thousands of images on a MacBook Pro with an M4 chip. Lastly, we host a public demo of Digital Collections Explorer.

2025-07-01T17:10:34Z 14 pages, 8 figures, 2 tables Comput. humanit. res. 1 (2025) e14 Ying-Hsiang Huang Benjamin Charles Germain Lee 10.1017/chr.2025.10017 http://arxiv.org/abs/2506.22729v2 Persistence Paradox in Dynamic Science 2025-07-01T16:14:58Z

Persistence is often regarded as a virtue in science. In this paper, however, we challenge this conventional view by highlighting its contextual nature, particularly how persistence can become a liability during periods of paradigm shift. We focus on the deep learning revolution catalyzed by AlexNet in 2012. Analyzing the 20-year career trajectories of over 5,000 scientists who were active in top machine learning venues during the preceding decade, we examine how their research focus and output evolved. We first uncover a dynamic period in which leading venues increasingly prioritized cutting-edge deep learning developments that displaced relatively traditional statistical learning methods. Scientists responded to these changes in markedly different ways. Those who were previously successful or affiliated with old teams adapted more slowly, experiencing what we term a rigidity penalty - a reluctance to embrace new directions leading to a decline in scientific impact, as measured by citation percentile rank. In contrast, scientists who pursued strategic adaptation - selectively pivoting toward emerging trends while preserving weak connections to prior expertise - reaped the greatest benefits. Taken together, our macro- and micro-level findings show that scientific breakthroughs act as mechanisms that reconfigure power structures within a field.

2025-06-28T02:21:19Z Honglin Bao Kai Li http://arxiv.org/abs/2507.00783v1 Generative AI and the future of scientometrics: current topics and future questions 2025-07-01T14:22:16Z

The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI's generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human 'reasoning'. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars' profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.

2025-07-01T14:22:16Z Benedetto Lepori Jens Peter Andersen Karsten Donnay http://arxiv.org/abs/2501.16197v2 HERITRACE: A User-Friendly Semantic Data Editor with Change Tracking and Provenance Management for Cultural Heritage Institutions 2025-07-01T08:51:57Z

HERITRACE is a data editor designed for galleries, libraries, archives and museums, aimed at simplifying data curation while enabling non-technical domain experts to manage data intuitively without losing its semantic integrity. While the semantic nature of RDF can pose a barrier to data curation due to its complexity, HERITRACE conceals this intricacy while preserving the advantages of semantic representation. The system natively supports provenance management and change tracking, ensuring transparency and accountability throughout the curation process. Although HERITRACE functions effectively out of the box, it offers a straightforward customization interface for technical staff, enabling adaptation to the specific data model required by a given collection. Current applications include the ParaText project, and its adoption is already planned for OpenCitations. Future developments will focus on integrating the RDF Mapping Language (RML) to enhance compatibility with non-RDF data formats, further expanding its applicability in digital heritage management.

2025-01-27T16:48:39Z 25 pages, 5 figures, 2 tables, 1 listing, submitted to Umanistica Digitale Arcangelo Massari Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Silvio Peroni Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy http://arxiv.org/abs/2408.05975v2 A Metascience Study of the Low-Code Scientific Field 2025-06-30T20:21:16Z

In the last years, model-related publications have been exploring the application of modeling techniques across various domains. Initially focused on UML and the Model-Driven Architecture approach, the literature has been evolving towards the usage of more general concepts such as Model-Driven Development or Model-Driven Engineering. More recently, however, the term "low-code" has taken the modeling field by storm, largely due to its association with several highly popular development platforms. The research community is still discussing the differences and commonalities between this emerging term and previous modeling-related concepts, as well as the broader implications of low-code on the modeling field. In this paper, we present a metascience study of Low-Code. Our study follows a two-fold approach: (1) to analyze the composition and growth (e.g., size, diversity, venues, and topics) of the emerging Low-Code community; and (2) to explore how these aspects differ from those of the "classical" model-driven community. Ultimately, we hope to trigger a discussion on the current state and potential future trajectory of the low-code community, as well as the opportunities for collaboration and synergies between the low-code and modeling communities.

2024-08-12T08:03:01Z Journal of Object Technology. Vol. 24, No. 2, 2025. Licensed under Attribution 4.0 International (CC BY 4.0) Mauro Dalle Lucca Tosi Javier Luis Cánovas Izquierdo Jordi Cabot 10.5381/jot.2025.24.2.a10 http://arxiv.org/abs/2507.00103v1 Gender and Discipline Shape Length, Content and Tone of Grant Peer Review Reports 2025-06-30T15:08:25Z

Peer review by experts is central to the evaluation of grant proposals, but little is known about how gender and disciplinary differences shape the content and tone of grant peer review reports. We analyzed 39,280 review reports submitted to the Swiss National Science Foundation between 2016 and 2023, covering 11,385 proposals for project funding across 21 disciplines from the Social Sciences and Humanities (SSH), Life Sciences (LS), and Mathematics, Informatics, Natural Sciences, and Technology (MINT). Using supervised machine learning, we classified over 1.3 million sentences by evaluation criteria and sentiment. Reviews in SSH were significantly longer and more critical, with less focus on the applicant's track record, while those in MINT were more concise and positive, with a higher focus on the track record, as compared to those in LS. Compared to male reviewers, female reviewers write longer reviews that more closely align with the evaluation criteria and express more positive sentiments. Female applicants tend to receive reviews with slightly more positive sentiment than male applicants. Gender and disciplinary culture influence how grant proposals are reviewed - shaping the tone, length, and focus of peer review reports. These differences have important implications for fairness and consistency in research funding.

2025-06-30T15:08:25Z Stefan Müller Gabriel Okasa Michaela Strinzel Anne Jorstad Katrin Milzow Matthias Egger http://arxiv.org/abs/2506.23419v1 BenchMake: Turn any scientific data set into a reproducible benchmark 2025-06-29T22:56:48Z

Benchmark data sets are a cornerstone of machine learning development and applications, ensuring new methods are robust, reliable and competitive. The relative rarity of benchmark sets in computational science, due to the uniqueness of the problems and the pace of change in the associated domains, makes evaluating new innovations difficult for computational scientists. In this paper a new tool is developed and tested to potentially turn any of the increasing numbers of scientific data sets made openly available into a benchmark accessible to the community. BenchMake uses non-negative matrix factorisation to deterministically identify and isolate challenging edge cases on the convex hull (the smallest convex set that contains all existing data instances) and partitions a required fraction of matched data instances into a testing set that maximises divergence and statistical significance, across tabular, graph, image, signal and textual modalities. BenchMake splits are compared to establish splits and random splits using ten publicly available benchmark sets from different areas of science, with different sizes, shapes, distributions.

2025-06-29T22:56:48Z 10 pages, 15 pages in Appendix, 15 figures, 5 tables, 57 references Amanda S Barnard http://arxiv.org/abs/2506.23366v1 Density, asymmetry and citation dynamics in scientific literature 2025-06-29T18:55:04Z

Scientific behavior is often characterized by a tension between building upon established knowledge and introducing novel ideas. Here, we investigate whether this tension is reflected in the relationship between the similarity of a scientific paper to previous research and its eventual citation rate. To operationalize similarity to previous research, we introduce two complementary metrics to characterize the local geometry of a publication's semantic neighborhood: (1) \emph{density} ($ρ$), defined as the ratio between a fixed number of previously-published papers and the minimum distance enclosing those papers in a semantic embedding space, and (2) asymmetry ($α$), defined as the average directional difference between a paper and its nearest neighbors. We tested the predictive relationship between these two metrics and its subsequent citation rate using a Bayesian hierarchical regression approach, surveying $\sim 53,000$ publications across nine academic disciplines and five different document embeddings. While the individual effects of $ρ$ on citation count are small and variable, incorporating density-based predictors consistently improves out-of-sample prediction when added to baseline models. These results suggest that the density of a paper's surrounding scientific literature may carry modest but informative signals about its eventual impact. Meanwhile, we find no evidence that publication asymmetry improves model predictions of citation rates. Our work provides a scalable framework for linking document embeddings to scientometric outcomes and highlights new questions regarding the role that semantic similarity plays in shaping the dynamics of scientific reward.

2025-06-29T18:55:04Z Nathaniel Imel Zachary Hafen http://arxiv.org/abs/2507.16820v1 Disaster Informatics after the COVID-19 Pandemic: Bibliometric and Topic Analysis based on Large-scale Academic Literature 2025-06-28T20:30:36Z

This study presents a comprehensive bibliometric and topic analysis of the disaster informatics literature published between January 2020 to September 2022. Leveraging a large-scale corpus and advanced techniques such as pre-trained language models and generative AI, we identify the most active countries, institutions, authors, collaboration networks, emergent topics, patterns among the most significant topics, and shifts in research priorities spurred by the COVID-19 pandemic. Our findings highlight (1) countries that were most impacted by the COVID-19 pandemic were also among the most active, with each country having specific research interests, (2) countries and institutions within the same region or share a common language tend to collaborate, (3) top active authors tend to form close partnerships with one or two key partners, (4) authors typically specialized in one or two specific topics, while institutions had more diverse interests across several topics, and (5) the COVID-19 pandemic has influenced research priorities in disaster informatics, placing greater emphasis on public health. We further demonstrate that the field is converging on multidimensional resilience strategies and cross-sectoral data-sharing collaborations or projects, reflecting a heightened awareness of global vulnerability and interdependency. Collecting and quality assurance strategies, data analytic practices, LLM-based topic extraction and summarization approaches, and result visualization tools can be applied to comparable datasets or solve similar analytic problems. By mapping out the trends in disaster informatics, our analysis offers strategic insights for policymakers, practitioners, and scholars aiming to enhance disaster informatics capacities in an increasingly uncertain and complex risk landscape.

2025-06-28T20:30:36Z 36 pages, 14 figures, 5 tables Ngan Tran Haihua Chen Ana Cleveland Yuhan Zhou http://arxiv.org/abs/2506.22946v1 Modular versus Hierarchical: A Structural Signature of Topic Popularity in Mathematical Research 2025-06-28T16:39:57Z

Mathematical researchers, especially those in early-career positions, face critical decisions about topic specialization with limited information about the collaborative environments of different research areas. The aim of this paper is to study how the popularity of a research topic is associated with the structure of that topic's collaboration network, as observed by a suite of measures capturing organizational structure at several scales. We apply these measures to 1,938 algorithmically discovered topics across 121,391 papers sourced from arXiv metadata during the period 2020--2025. Our analysis, which controls for the confounding effects of network size, reveals a structural dichotomy--we find that popular topics organize into modular "schools of thought," while niche topics maintain hierarchical core-periphery structures centered around established experts. This divide is not an artifact of scale, but represents a size-independent structural pattern correlated with popularity. We also document a "constraint reversal": after controlling for size, researchers in popular fields face greater structural constraints on collaboration opportunities, contrary to conventional expectations. Our findings suggest that topic selection is an implicit choice between two fundamentally different collaborative environments, each with distinct implications for a researcher's career. To make these structural patterns transparent to the research community, we developed the Math Research Compass (https://mathresearchcompass.com), an interactive platform providing data on topic popularity and collaboration patterns across mathematical topics.

2025-06-28T16:39:57Z Brian Hepler http://arxiv.org/abs/2508.00838v1 The Attribution Crisis in LLM Search Results 2025-06-27T15:44:16Z

Web-enabled LLMs frequently answer queries without crediting the web pages they consume, creating an "attribution gap" - the difference between relevant URLs read and those actually cited. Drawing on approximately 14,000 real-world LMArena conversation logs with search-enabled LLM systems, we document three exploitation patterns: 1) No Search: 34% of Google Gemini and 24% of OpenAI GPT-4o responses are generated without explicitly fetching any online content; 2) No citation: Gemini provides no clickable citation source in 92% of answers; 3) High-volume, low-credit: Perplexity's Sonar visits approximately 10 relevant pages per query but cites only three to four. A negative binomial hurdle model shows that the average query answered by Gemini or Sonar leaves about 3 relevant websites uncited, whereas GPT-4o's tiny uncited gap is best explained by its selective log disclosures rather than by better attribution. Citation efficiency - extra citations provided per additional relevant web page visited - varies widely across models, from 0.19 to 0.45 on identical queries, underscoring that retrieval design, not technical limits, shapes ecosystem impact. We recommend a transparent LLM search architecture based on standardized telemetry and full disclosure of search traces and citation logs.

2025-06-27T15:44:16Z Ilan Strauss Jangho Yang Tim O'Reilly Sruly Rosenblat Isobel Moure 10.35650/AIDP.4114.d.2025 http://arxiv.org/abs/2506.21331v1 Automatic Reviewers Assignment to a Research Paper Based on Allied References and Publications Weight 2025-06-26T14:44:06Z

Everyday, a vast stream of research documents is submitted to conferences, anthologies, journals, newsletters, annual reports, daily papers, and various periodicals. Many such publications use independent external specialists to review submissions. This process is called peer review, and the reviewers are called referees. However, it is not always possible to pick the best referee for reviewing. Moreover, new research fields are emerging in every sector, and the number of research papers is increasing dramatically. To review all these papers, every journal assigns a small team of referees who may not be experts in all areas. For example, a research paper in communication technology should be reviewed by an expert from the same field. Thus, efficiently selecting the best reviewer or referee for a research paper is a big challenge. In this research, we propose and implement program that uses a new strategy to automatically select the best reviewers for a research paper. Every research paper contains references at the end, usually from the same area. First, we collect the references and count authors who have at least one paper in the references. Then, we automatically browse the web to extract research topic keywords. Next, we search for top researchers in the specific topic and count their h-index, i10-index, and citations for the first n authors. Afterward, we rank the top n authors based on a score and automatically browse their homepages to retrieve email addresses. We also check their co-authors and colleagues online and discard them from the list. The remaining top n authors, generally professors, are likely the best referees for reviewing the research paper.

2025-06-26T14:44:06Z IEEE Conference Proceedings (5 Pages) 2018 4th International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 2018, pp. 1-5 Tamim Al Mahmud B M Mainul Hossain Dilshad Ara 10.1109/CCAA.2018.8777730 http://arxiv.org/abs/2506.21232v1 The State of Papers, Retractions, and Preprints: Evidence from the CrossRef Database (2004-2024) 2025-06-26T13:21:25Z

A 20-year analysis of CrossRef metadata demonstrates that global scholarly output -- encompassing publications, retractions, and preprints -- exhibits strikingly inertial growth, well-described by exponential, quadratic, and logistic models with nearly indistinguishable goodness-of-fit. Retraction dynamics, in particular, remain stable and minimally affected by the COVID-19 shock, which contributed less than 1% to total notices. Since 2004, publications doubled every 9.8 years, retractions every 11.4 years, and preprints at the fastest rate, every 5.6 years. The findings underscore a system primed for ongoing stress at unchanged structural bottlenecks. Although model forecasts diverge beyond 2024, the evidence suggests that the future trajectory of scholarly communication will be determined by persistent systemic inertia rather than episodic disruptions -- unless intentionally redirected by policy or AI-driven reform.

2025-06-26T13:21:25Z Khalid M. Saqr