https://arxiv.org/api/7nrz3OdyBWo3Ct4dUpsasBDcfGw 2026-06-14T05:35:45Z 6065 600 15 http://arxiv.org/abs/2508.18217v2 Lost Data in Electron Microscopy 2025-09-17T14:18:59Z

The goal of this study is to estimate the amount of lost data in electron microscopy and to analyze the extent to which experimentally acquired images are utilized in peer-reviewed scientific publications. Analysis of the number of images taken on electron microscopes at a core user facility and the number of images subsequently included in peer-reviewed scientific journals revealed low efficiency of data utilization. Up to around 90% of electron microscopy data generated during routine instrument operation remain unused. Of the more than 150 000 electron microscopy images evaluated in this study, only approximately 3 500 (just over 2%) were made available in publications. For the analyzed dataset, the amount of lost data in electron microscopy can be estimated as >90% (in terms of data being recorded but not being published in peer-reviewed literature). On the one hand, these results highlight a shortcoming in the optimal use of microscopy images; on the other hand, they indicate the existence of a large pool of electron microscopy data that can facilitate research in data science and the development of AI-based projects. The considerations important to unlock the potential of lost data are discussed in the present article.

2025-08-10T13:50:13Z 20 pages, 4 figures, 2 tables Nina M. Ivanova Alexey S. Kashin Valentine P. Ananikov 10.3390/chemistry7050160 http://arxiv.org/abs/2509.13524v1 The NIAID Discovery Portal: A Unified Search Engine for Infectious and Immune-Mediated Disease Datasets 2025-09-16T20:38:35Z

The NIAID Data Ecosystem Discovery Portal (https://data.niaid.nih.gov) provides a unified search interface for over 4 million datasets relevant to infectious and immune-mediated disease (IID) research. Integrating metadata from domain-specific and generalist repositories, the Portal enables researchers to identify and access datasets using user-friendly filters or advanced queries, without requiring technical expertise. The Portal supports discovery of a wide range of resources, including epidemiological, clinical, and multi-omic datasets, and is designed to accommodate exploratory browsing and precise searches. The Portal provides filters, prebuilt queries, and dataset collections to simplify the discovery process for users. The Portal additionally provides documentation and an API for programmatic access to harmonized metadata. By easing access barriers to important biomedical datasets, the NIAID Data Ecosystem Discovery Portal serves as an entry point for researchers working to understand, diagnose, or treat IID. Valuable datasets are often overlooked because they are difficult to locate. The NIAID Data Ecosystem Discovery Portal fills this gap by providing a centralized, searchable interface that empowers users with varying levels of technical expertise to find and reuse data. By standardizing key metadata fields and harmonizing heterogeneous formats, the Portal improves data findability, accessibility, and reusability. This resource supports hypothesis generation, comparative analysis, and secondary use of public data by the IID research community, including those funded by NIAID. The Portal supports data sharing by standardizing metadata and linking to source repositories, and maximizes the impact of public investment in research data by supporting scientific advancement via secondary use.

2025-09-16T20:38:35Z 20 pages, 3 figures, 1 table, submitted to mSystems Ginger Tsueng The Scripps Research Institute, La Jolla, CA, USA Emily Bullen The Scripps Research Institute, La Jolla, CA, USA Candice Czech The Scripps Research Institute, La Jolla, CA, USA Dylan Welzel The Scripps Research Institute, La Jolla, CA, USA Leandro Collares The Scripps Research Institute, La Jolla, CA, USA Jason Lin The Scripps Research Institute, La Jolla, CA, USA Everaldo Rodolpho The Scripps Research Institute, La Jolla, CA, USA Zubair Qazi The Scripps Research Institute, La Jolla, CA, USA Nichollette Acosta The Scripps Research Institute, La Jolla, CA, USA Lisa M. Mayer National Institute of Allergy and Infectious Diseases, Rockville, MD, USA Sudha Venkatachari National Cancer Institute, Rockville, MD, USA Zorana Mitrović Vučičević Velsera, Charlestown, MA, USA Poromendro N. Burman Velsera, Charlestown, MA, USA Deepti Jain Velsera, Charlestown, MA, USA Jack DiGiovanna Velsera, Charlestown, MA, USA Maria Giovanni National Institute of Allergy and Infectious Diseases, Rockville, MD, USA Asiyah Lin National Institute of Allergy and Infectious Diseases, Rockville, MD, USA Wilbert Van Panhuis National Institute of Allergy and Infectious Diseases, Rockville, MD, USA Laura D. Hughes The Scripps Research Institute, La Jolla, CA, USA Andrew I. Su The Scripps Research Institute, La Jolla, CA, USA Chunlei Wu The Scripps Research Institute, La Jolla, CA, USA http://arxiv.org/abs/2509.13236v1 Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation 2025-09-16T16:43:34Z

Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.

2025-09-16T16:43:34Z IEEE-ISTAS conference Fitsum Sileshi Beyene Christopher L. Dancy http://arxiv.org/abs/2501.16701v3 Understanding the importance of SHAPE to the UK research ecosystem 2025-09-13T19:35:59Z

The UK has a long-established reputation for excellence in research across a broad range of fields, but in recent years, there has been greater emphasis on STEM investment and greater recognition of the UK's success in STEM. This paper examines the relative strengths of SHAPE disciplines and demonstrates that the UK's SHAPE research portfolio outperforms the UK's STEM research, for each international benchmark considered in this work. It is argued that SHAPE research is becoming increasingly important as a partner to STEM as the widespread use of technology creates societal challenges. It is also argued that the strength of UK SHAPE is the basis of a strategic advantage for UK research.

2025-01-28T04:33:28Z 13 pages, 9 figures, minor metadata update Hélène Draux Briony Fane Daniel W. Hook Juergen Wastl Philip Lewis Molly Morgan Jones Pablo Roblero James R. Wilsdon http://arxiv.org/abs/2509.12245v1 Identifying Information Technology Research Trends through Text Mining of NSF Awards 2025-09-11T17:59:38Z

Information Technology (IT) is recognized as an independent and unique research field. However, there has been ambiguity and difficulty in identifying and differentiating IT research from other close variations. Given this context, this paper aimed to explore the roots of the Information Technology (IT) research domain by conducting a large-scale text mining analysis of 50,780 abstracts from awarded NSF CISE grants from 1985 to 2024. We categorized the awards based on their program content, labeling human-centric programs as IT research programs and infrastructure-centric programs as other research programs based on the IT definitions in the literature. This novel approach helped us identify the core concepts of IT research and compare the similarities and differences between IT research and other research areas. The results showed that IT differentiates itself from other close variations by focusing more on the needs of users, organizations, and societies.

2025-09-11T17:59:38Z 8 pages, under review Said Varlioglu Hazem Said Murat Ozer Nelly Elsayed http://arxiv.org/abs/2409.00081v2 Examining Different Research Communities: Authorship Network 2025-09-10T22:53:38Z

Google Scholar is one of the top search engines to access research articles across multiple disciplines for scholarly literature. Google scholar advance search option gives the privilege to extract articles based on phrases, publishers name, authors name, time duration etc. In this work, we collected Google Scholar data (2000-2021) for two different research domains in computer science: Data Mining and Software Engineering. The scholar database resources are powerful for network analysis, data mining, and identify links between authors via authorship network. We examined coauthor-ship network for each domain and studied their network structure. Extensive experiments are performed to analyze publications trend and identifying influential authors and affiliated organizations for each domain. The network analysis shows that the networks features are distinct from one another and exhibit small communities within the influential authors of a particular domain.

2024-08-24T19:04:02Z Shrabani Ghosh 10.1007/978-3-031-82435-7_6 http://arxiv.org/abs/2506.00074v2 Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations 2025-09-10T17:27:44Z

This paper evaluates the performance of six open-weight LLMs (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, llama3.1-70b) in recommending experts in physics across five tasks: top-k experts by field, influential scientists by discipline, epoch, seniority, and scholar counterparts. The evaluation examines consistency, factuality, and biases related to gender, ethnicity, academic popularity, and scholar similarity. Using ground-truth data from the American Physical Society and OpenAlex, we establish scholarly benchmarks by comparing model outputs to real-world academic records. Our analysis reveals inconsistencies and biases across all models. mixtral-8x7b produces the most stable outputs, while llama3.1-70b shows the highest variability. Many models exhibit duplication, and some, particularly gemma2-9b and llama3.1-8b, struggle with formatting errors. LLMs generally recommend real scientists, but accuracy drops in field-, epoch-, and seniority-specific queries, consistently favoring senior scholars. Representation biases persist, replicating gender imbalances (reflecting male predominance), under-representing Asian scientists, and over-representing White scholars. Despite some diversity in institutional and collaboration networks, models favor highly cited and productive scholars, reinforcing the rich-getricher effect while offering limited geographical representation. These findings highlight the need to improve LLMs for more reliable and equitable scholarly recommendations.

2025-05-29T20:11:11Z 40 pages: 10 main (incl. 9 figures), 3 references, and 27 appendix. Paper under-review Daniele Barolo Chiara Valentin Fariba Karimi Luis Galárraga Gonzalo G. Méndez Lisette Espín-Noboa http://arxiv.org/abs/2509.08710v1 A Use Case Lens on Digital Cultural Heritage 2025-09-10T15:59:31Z

This article proposes a novel methodological approach for developing use cases for CH e-infrastuctures documented using Jupyter Notebooks (JNs), enabling transparency and reproducibility. We also address the present problem of use cases that are not consistently documented to cover all key aspects that are derived from the use case literature review outside of CH field to define a useful use case. Purpose. Our primary objective is to explore the practices around creating and analysing use cases related to digital cultural heritage. Our review of the literature showed a substantial deviation in the depth and coverage of use cases and revealed the need for a more robust and consistent approach to creating use cases in a digital heritage context. We developed a framework to develop use cases to support the ongoing efforts to expand the use of eInfrastructures in the digital heritage domain as a first step. Design/methodology/approach. Our research design combines desk research of existing literature and analysing examples of use cases documented in projects. We examine the challenges and inconsistencies in the current practice of use case production in digital heritage. Finally, we synthesize a systematic process to generate use cases which is illustrated by five example use cases within the context. Our work impacts directly such infrastructures and communities as the International GLAM Labs Community, AI for Libraries, Archives, and Museums (AI4LAM) and Time Machine Organisation. This work advances the use of data research infrastructures within communities of researchers, scholars, students, GLAM (Galleries, Libraries, Archives, and Museums) institutions, and Cultural Heritage and Cultural and Creative Industries (CCIs).

2025-09-10T15:59:31Z Gustavo Candela Milena Dobreva Henk Alkemade Olga Holownia Mahendra Mahey Sarah Ames Karen Renaud Ines Vodopivec Benjamin Charles Germain Lee Thomas Padilla Steven Claeyssens Isto Huvila Beth Knazook http://arxiv.org/abs/2509.08299v1 Causal evidence of racial and institutional biases in accessing paywalled articles and scientific data 2025-09-10T05:39:08Z

Scientific progress fundamentally depends on researchers' ability to access and build upon the work of others. Yet, a majority of published work remains behind expensive paywalls, limiting access to universities that can afford subscriptions. Furthermore, even when articles are accessible, the underlying datasets could be restricted, available only through a "reasonable request" to the authors. One way researchers could overcome these barriers is by relying on informal channels, such as emailing authors directly, to obtain paywalled articles or restricted datasets. However, whether these informal channels are hindered by racial and/or institutional biases remains unknown. Here, we combine qualitative semi-structured interviews, large-scale observational analysis, and two randomized audit experiments to examine racial and institutional disparities in access to scientific knowledge. Our analysis of 250 million articles reveals that researchers in the Global South cite paywalled papers and upon-request datasets at significantly lower rates than their Global North counterparts, and that these access gaps are associated with reduced knowledge breadth and scholarly impact. To interrogate the mechanisms underlying this phenomenon, we conduct two randomized email audit studies in which fictional PhD students differing in racial background and institutional affiliation request access to paywalled articles (N = 18,000) and datasets (N = 11,840). We find that racial identity more strongly predicts response rate to paywalled article requests compared to institutional affiliation, whereas institutional affiliation played a larger role in shaping access to datasets. These findings reveal how informal gatekeeping can perpetuate structural inequities in science, highlighting the need for stronger data-sharing mandates and more equitable open access policies.

2025-09-10T05:39:08Z 44 pages, 9 figures Hazem Ibrahim Fengyuan Liu Khalid Mengal Aaron R. Kaufman Yasir Zaki Talal Rahwan http://arxiv.org/abs/2508.06401v3 A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges 2025-09-09T16:35:32Z

This systematic review of the research literature on retrieval-augmented generation (RAG) provides a focused analysis of the most highly cited studies published between 2020 and May 2025. A total of 128 articles met our inclusion criteria. The records were retrieved from ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and the Digital Bibliography and Library Project (DBLP). RAG couples a neural retriever with a generative language model, grounding output in up-to-date, non-parametric memory while retaining the semantic generalisation stored in model weights. Guided by the PRISMA 2020 framework, we (i) specify explicit inclusion and exclusion criteria based on citation count and research questions, (ii) catalogue datasets, architectures, and evaluation practices, and (iii) synthesise empirical evidence on the effectiveness and limitations of RAG. To mitigate citation-lag bias, we applied a lower citation-count threshold to papers published in 2025 so that emerging breakthroughs with naturally fewer citations were still captured. This review clarifies the current research landscape, highlights methodological gaps, and charts priority directions for future research.

2025-08-08T15:37:14Z 58 page Andrew Brown Muhammad Roman Barry Devereux http://arxiv.org/abs/2509.07142v1 Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models 2025-09-08T18:46:08Z

This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.

2025-09-08T18:46:08Z Accepted for publication in International Journal on Digital Libraries (IJDL) International Journal on Digital Libraries, vol. 26, no. 4, pp. 23, December 2025 Zhiyin Tan Jennifer D'Souza 10.1007/s00799-025-00429-5 http://arxiv.org/abs/2509.06412v1 Compare: A Framework for Scientific Comparisons 2025-09-08T08:05:26Z

Navigating the vast and rapidly increasing sea of academic publications to identify institutional synergies, benchmark research contributions and pinpoint key research contributions has become an increasingly daunting task, especially with the current exponential increase in new publications. Existing tools provide useful overviews or single-document insights, but none supports structured, qualitative comparisons across institutions or publications. To address this, we demonstrate Compare, a novel framework that tackles this challenge by enabling sophisticated long-context comparisons of scientific contributions. Compare empowers users to explore and analyze research overlaps and differences at both the institutional and publication granularity, all driven by user-defined questions and automatic retrieval over online resources. For this we leverage on Retrieval-Augmented Generation over evolving data sources to foster long context knowledge synthesis. Unlike traditional scientometric tools, Compare goes beyond quantitative indicators by providing qualitative, citation-supported comparisons.

2025-09-08T08:05:26Z Accepted at CIKM 2025 Moritz Staudinger Wojciech Kusa Matteo Cancellieri David Pride Petr Knoth Allan Hanbury http://arxiv.org/abs/2509.12230v1 Storage places in diplomatic texts (7th-13th centuries). Lexical, semantic, and digital investigation 2025-09-08T06:55:31Z

This study examines the evolution of references to grain storage structures in medieval European charters, based on a quantitative and semantic analysis of the digitized CEMA (Cartae Europae Medii Aevi) corpus comprising more than 225,000 documents. The author applies text mining and distributional analysis methods to a lexicon of some forty terms designating storage locations (grangia, horreum, granarium, granica, etc.), cross-referencing these data with references to grain and analyzing their semantic contexts over the long term. The analysis reveals a paradigm shift between the early Middle Ages (decentralized, loosely regulated storage) and the 12th-13th centuries (centralization of storage by the ruling classes). Granaries became instruments of spatial polarization and social control, contributing to the accentuation of social domination in medieval Europe. This evolution was accompanied by a new conceptualization of storage, both material and spiritual.

2025-09-08T06:55:31Z in French language PUM, pp.143-178, 2022 Nicolas Perreaux LAMOP, CNRS http://arxiv.org/abs/2509.06212v1 Synergy, not size: How collaboration architecture shapes scientific disruption 2025-09-07T21:19:35Z

The mechanisms driving different types of scientific innovation through collaboration remain poorly understood. Here we develop a comprehensive framework analyzing over 14 million papers across 19 disciplines from 1960 to 2020 to unpack how collaborative synergy shapes research disruption. We introduce the synergy factor to quantify collaboration cost-benefit dynamics, revealing discipline-specific architectures where Physics peaks at medium team sizes while humanities achieve maximal synergy through individual scholarship. Our mediation analysis demonstrates that collaborative synergy, not team size alone, mediates 75% of the relationship between team composition and disruption. Key authors play a catalytic role, with papers featuring exceptional researchers showing 561% higher disruption indices. Surprisingly, high-citation authors reduce disruptive potential while those with breakthrough track records enhance it, challenging traditional evaluation metrics. We identify four distinct knowledge production modes: elite-driven, baseline, heterogeneity-driven, and low-cost. These findings reveal substantial heterogeneity in optimal collaboration strategies across disciplines and provide evidence-based guidance for research organization, with implications for science policy and the design of research institutions in an increasingly collaborative scientific landscape.

2025-09-07T21:19:35Z 29 pages, 4 figures Bili Zheng Jianhua Hou http://arxiv.org/abs/2509.06206v1 Beyond Productivity Gaps: Temporal Patterns of Gender Differences in Scientific Knowledge Creation 2025-09-07T21:03:22Z

Gender inequality in scientific careers has been extensively documented through aggregate measures such as total publications and cumulative citations, yet the temporal dynamics underlying these disparities remain largely unexplored. Here we developed a multi-dimensional framework to examine gender differences in scientific knowledge creation through three complementary temporal dimensions: stability (consistency of performance over time), volatility (degree of year-to-year fluctuation), and persistence (ability to maintain high performance for extended periods). Using comprehensive bibliometric data from SciSciNet covering 62.5 million authors whose careers began between 1960-2010, we constructed knowledge creation capability measures that captured how scientists absorb knowledge from diverse sources and contribute to field advancement. We found that female scientists demonstrated significantly higher knowledge production stability (0.170 vs. 0.119 for males) while simultaneously exhibiting greater year-to-year volatility (6.606 vs. 6.228), revealing a striking paradox in career dynamics. Female scientists showed persistence advantages under moderate performance requirements but faced disadvantages under extreme criteria demanding sustained peak performance. However, these patterns varied substantially across disciplines, with female advantages strongest in humanities and social sciences while STEM fields show mixed results.

2025-09-07T21:03:22Z 23 pages, 5 figures Bili Zheng Chenyi Yang Jianhua Hou