https://arxiv.org/api/HepzltdedTMMGk+oVMWB6YJ88yU 2026-06-10T15:37:50Z 6061 255 15 http://arxiv.org/abs/2603.05984v1 Fostering Knowledge Infrastructures in Science Communication and Aerospace Engineering 2026-03-06T07:31:01Z

Knowledge infrastructures are defined as robust networks of people, artifacts, and institutions that generate, share and maintain specific knowledge. Yet, many domains are fragmented and far from robustly networked, such as science communication or aerospace engineering. While FAIR (Findable, Accessible, Interoperable, Reusable) data management tools exist, their adoption in these domains is limited. Several challenges inhibit this adoption, from complex heterogeneous data formats to lack of structured support to outright incentives against collaboration or legal barriers. This doctoral work outlines how to foster underdeveloped knowledge infrastructures with the use-cases of science communication and aerospace engineering. By analyzing these problems and identifying available solutions, tool-supported workflows towards collaborative infrastructure can be implemented and evaluated. These include human-in-the-loop artificial intelligence (AI)-supported workflows for information extraction and processing, wiki- and knowledge-graph-based digital libraries, and stakeholder-requirement-driven interfaces. While these developed tools for workflow automation and knowledge representation show promise, significant challenges remain. Future work will have to go beyond technical problem-solving and address the societal and legal barriers to unlock the particular domains. Beyond that, advocates of emerging knowledge infrastructures in any domain are welcome to apply the findings of this work to foster the networking of available knowledge.

2026-03-06T07:31:01Z 4 pages, 1 figure, accepted at JCDL 2025 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Tim Wittenborg 10.1109/JCDL67857.2025.00052 http://arxiv.org/abs/2603.05192v1 Aerospace.Wikibase: Towards a Knowledge Infrastructure for Aerospace Engineering 2026-03-05T14:03:48Z

While Aerospace engineering can benefit greatly from collaborative knowledge management, its infrastructure is still fragmented. Bridging this divide is essential to reduce the current practice of redundant work and to address the challenges posed by the rapidly growing volume of aviation data. This study presents an accessible platform, built on Wikibase, to enable collaborative sharing and curation of aerospace engineering knowledge, initially populated with data from a recent systematic literature review. As a solid foundation, the Aerospace.Wikibase provides over 700 terms related to processes, software and data, openly available for future extension. Linking project-specific concepts to persistent, independent infrastructure enables aerospace engineers to collaborate on universal knowledge without risking the appropriation of project information, thereby promoting sustainable solutions to modern challenges while acknowledging the limitations of the industry.

2026-03-05T14:03:48Z 4 pages, 1 figure, submitted to JCDL 2025 Tim Wittenborg Ildar Baimuratov Jamal Eldemashki http://arxiv.org/abs/2603.05177v1 SWARM-SLR AIssistant: A Unified Framework for Scalable Systematic Literature Review Automation 2026-03-05T13:44:45Z

Despite a growing ecosystem of tools supporting Systematic Literature Reviews (SLRs), integrating them into user-friendly workflows remains challenging. The Streamlined Workflow for Automating Machine-Actionable Systematic Literature Reviews (SWARM-SLR) unified the tool annotation and provided a cohesive yet modular workflow, but faced scalability and usability issues. We introduce the SWARM-SLR AIssistant, a unified framework that combines the SWARM-SLR's structured methodology with an agent-based assistant that integrates research tools in a modular interface. The first SWARM-SLR stage is integrated, enabling conversational, LLM-guided support and persistent data storage. To address the tool assessment bottleneck, we propose a centralized tool registry that allows developers to annotate and register tools autonomously using a shared metadata schema. Preliminary evaluation shows improved usability, but challenges remain in balancing efficiency, accessibility, and transparency. Further development is needed to realize scalable SLR automation.

2026-03-05T13:44:45Z 4 pages, 3 figures, submitted to JCDL 2025 Tim Wittenborg Allard Oelen Manuel Prinz http://arxiv.org/abs/2602.01712v2 Mapping a Decade of Avian Influenza Research (2014-2023): A Scientometric Analysis from Web of Science 2026-03-05T07:09:02Z

This scientometric study analyzes Avian Influenza research from 2014 to 2023 using bibliographic data from the Web of Science database. We examined publication trends, sources, authorship, collaborative networks, document types, and geographical distribution to gain insights into the global research landscape. Results reveal a steady increase in publications, with high contributions from Chinese and American institutions. Journals such as PLoS One and the Journal of Virology published the highest number of studies, indicating their influence in this field. The most prolific institutions include the Chinese Academy of Sciences and the University of Hong Kong, while the College of Veterinary Medicine at South China Agricultural University emerged as the most productive department. China and the USA lead in publication volume, though developed nations like the United Kingdom and Germany exhibit a higher rate of international collaboration. "Articles" are the most common document type, constituting 84.6% of the total, while "Reviews" account for 7.6%. This study provides a comprehensive view of global trends in Avian Influenza research, emphasizing the need for collaborative efforts across borders.

2026-02-02T06:37:20Z 24 pages, 7 figures, Research Article Journal of Health Information Research, 3(1), 1 - 24, 2026 Muneer Ahmad Undie Felicia Nkatv Amrita Sharma Gorrety Maria Juma Nicholas Kamoga Julirine Nakanwagi 10.47524/jhir.v3i1.25 http://arxiv.org/abs/2512.10268v2 Balancing the Byline: Exploring Gender and Authorship Patterns in Canadian Science Publishing Journals 2026-03-04T20:33:00Z

Canada is internationally recognized for its leadership in science and its commitment to equity, diversity, and inclusion (EDI) in STEM (science, technology, engineering, and math) fields. Despite this leadership, limited research has examined gender disparities in scientific publishing within the Canadian context. This study analyzes over 67,000 articles published in 24 Canadian Science Publishing (CSP) journals between 2010 and 2021 to better understand patterns of gender representation. Findings show that women accounted for less than one-third of published authors across CSP journals. Representation varied by discipline, with higher proportions of women in biomedical sciences and lower proportions of women in engineering - trends that mirror broader national and global patterns. Notably, the proportion of women submitting manuscripts closely matched those published, suggesting that broader workforce disparities may play a larger role than publication bias. Women were less likely to be solo authors or to hold prominent authorship positions, such as first or last author - roles typically associated with research leadership and career advancement. These findings point to the need for a two-fold response: continued efforts to address systemic barriers to women's participation in science, and a review of publishing practices to ensure equitable access, recognition, and inclusion for all researchers.

2025-12-11T04:14:12Z Supplementary Information included Eden J. Hennessey Amanda Desnoyers Margaret Christ Adrianna Tassone Skye Hennessey Bianca Dreyer Alex Jay Patricia Sanchez Shohini Ghose http://arxiv.org/abs/2511.11953v2 National and state-level datasets of United States forensic DNA databases 2001-2025 2026-03-04T02:46:52Z

Forensic DNA databases in the United States have expanded substantially over the past two decades. However, comprehensive, harmonized data describing database structure and composition remain limited. This dataset series documents forensic DNA infrastructure across national and state levels from 2001 to 2025. It includes a reconstructed time series of monthly National DNA Index System (NDIS) statistics from FBI archives, capturing counts of offender, arrestee, and forensic profiles, participating laboratory totals, and investigations aided. A complementary dataset compiles publicly available state-level statistics and policy metadata on arrestee collection laws, familial search practices, and DNA collection statutes across all 50 states. A third dataset provides standardized demographic and annual collection data obtained through previously published public records requests, including sex and racial composition where reported. Together, these resources provide a foundation for studying the historical development of forensic DNA systems in the U.S., enabling longitudinal and cross-sectional analyses of database growth, policy variation, and reporting practices across jurisdictions.

2025-11-15T00:01:02Z 12 pages, 7 figures Yemko Pryor Virum Ranka Joao Pedro Donadio Samantha C. Muller Jenna Wilson Tina Lasisi http://arxiv.org/abs/2603.00399v2 A Data-Driven Analysis for Engineering Conferences: The Institute of Industrial and Systems Engineering (IISE) Annual Conference Proceedings (2002-2025) 2026-03-03T22:40:07Z

Charting the intellectual evolution of a scientific discipline is crucial for identifying its core contributions, challenges, and future directions. The IISE Annual Conference proceedings offer a rich longitudinal archive of the Industrial and Systems Engineering (ISE) community's development, but the sheer volume of scholarship produced over two decades makes a holistic analysis difficult. Traditional reviews often fail to capture the full scale of thematic shifts and complex collaboration networks that define the community's growth. This paper presents a computational analysis of IISE proceedings from 2002 to 2025, drawing on an initial dataset of 9,350 titles from ProQuest for thematic analysis and 8,958 titles from Google Scholar for citation analysis, to deliver a cartography of the ISE field's intellectual history. Leveraging Large Language Models (LLMs) for domain-aware classification, Natural Language Processing, and Network Science, our study systematically maps thematic evolution to identify dominant, emerging, and receding research topics. We analyze citation data and co-authorship networks to uncover influential papers and authors, providing critical insights into knowledge diffusion and community structure. Through this comprehensive analysis, we establish a baseline for understanding the trajectory of ISE research and offer valuable insights for researchers, practitioners, and educators. The findings illuminate the field's intellectual assets and provide a data-informed map to guide the future of ISE. To foster reproducibility and further research, the curated dataset used in this study and the results will be made publicly available.

2026-02-28T01:10:46Z 7 pages, 3 figures, IISE Annual Conference Proceedings 2026 H. Sinan Bank Casey E. Eaton http://arxiv.org/abs/2603.03457v1 Funders open access mandates: uneven uptake and challenging models 2026-03-03T19:15:02Z

Over the last two decades, research funders have adopted Open Access (OA) mandates, with various forms and success. While some funders emphasize gold OA through article processing charges, others favour green OA and repositories, leading to a fragmented policy landscape. Compliance with these mandates depends on several factors, including disciplinary field, monitoring, and availability of repository infrastructure. Based on 5 million papers supported by 36 funders from 20 countries, 11 million papers funded by other organisations, and 10 million papers without any funding reported, this study explores how different policies influence the adoption of OA. Findings indicate a sustained growth in OA overall, especially hybrid and gold OA, and that funded papers are more likely to be OA than unfunded papers. Those results suggest that policies such as Plan S, as well as read-and-publish agreements, have had a strong influence on OA adoption, especially among European funders. However, the global low uptake of Diamond OA and limited indexing of OA outputs in Latin American countries highlight ongoing disparities, influenced by funding constraints, journal visibility, and regional infrastructure challenges.

2026-03-03T19:15:02Z 17 pages (incl. supplementary materials) Lucía Céspedes Madelaine Hare Simon van Bellen Philippe Mongeon Vincent Larivière http://arxiv.org/abs/2603.03126v1 The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment 2026-03-03T15:58:18Z

Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

2026-03-03T15:58:18Z 18 pages, 8 figures, 7 tables. Dataset DOI: 10.57967/hf/7850. Code: https://github.com/J0nasW/science-datalake Jonas Wilinski http://arxiv.org/abs/2603.00084v2 DeepXiv-SDK: An Agentic Data Interface for Scientific Literature 2026-03-03T07:41:40Z

LLM-agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human-centric data on the Internet, such as HTML web-pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look-up. This gap motivates the development of \textit{an agentic data interface}, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost-aware manner. In this paper, we introduce DeepXiv-SDK, which offers a three-layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human-centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad-hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built-in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv-SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open-access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv-SDK is free to use with registration.

2026-02-14T23:07:28Z Project at https://github.com/DeepXiv/deepxiv_sdk Hongjin Qian Ziyi Xia Ze Liu Jianlyu Chen Kun Luo Minghao Qin Chaofan Li Lei Xiong Junwei Lan Sen Wang Zhengyang Liang Yingxia Shao Defu Lian Zheng Liu http://arxiv.org/abs/2603.01718v1 Changes in Manuscript Length, Research Team Size, and International Collaboration in the Post-2022 Period: Evidence from PLOS ONE 2026-03-02T10:42:23Z

Large language models (LLMs) have diffused rapidly into academic writing since late 2022. Using the complete population of 109,393 research articles published in \textit{PLOS ONE} between 2019 and 2025, we examine population-level structural publication indicators, including full-text manuscript length, authorship team size, reference volume, and cross-linguistic collaboration, before and after 2022. \textit{PLOS ONE}'s multidisciplinary scope and consistent editorial framework allow cross-field comparison under uniform conditions over an extended period. Manuscript length increased substantially, with gains ranging from 14.8\% among African-affiliated authors and 11.7\% among Asian-affiliated authors to 5.3\% among native English-speaking (NES) authors, cutting the word-count gap by 39\%. More strikingly, non-native English-speaking (NNES) authors reduced both authorship team size, from 6.54 to 6.06 authors, or 7.3\%, and collaboration with NES co-authors, from 17.8\% to 12.2\%, or 36\%, while NES authors remained stable in both team size and collaboration rates. Reference counts increased modestly and uniformly across groups. These findings suggest that post-2022 tools may be reshaping not only how science is written, but who writes it together.

2026-03-02T10:42:23Z Yossi Ben-Zion Bar-Ilan University, Ramat Gan, Israel Eden Cohen Bar-Ilan University, Ramat Gan, Israel Nitza Davidovitch Ariel University, Ariel, Israel http://arxiv.org/abs/2603.01117v1 China leads scientific trends; the West launches new ones 2026-03-01T14:01:50Z

How nations shape the scientific frontier matters for technological competition, but standard metrics, including publication counts, citations, and disruption indices, look backward and fail to distinguish between fundamentally different leadership strategies. We develop and validate two forward-looking model-based measures and apply them to tens of millions of articles since 1990. The first embeds research pathways within an evolving hypergraph of concepts and scientists to identify leadership in emerging areas--work that anticipates where the scientific crowd is heading. The second embeds evolving samples of ideas and disciplines drawn upon in past research to identify leadership in surprising new directions as unexpected combinations become routine and science reorganizes around them. China became the global leader in emerging areas roughly a decade ago, well before it led in volume, reflecting a capacity to detect and amplify nascent consensus at scale. The United States and Europe show the opposite profile: declining emergence shares but persistent leadership in prescient work, especially research bridging disciplinary boundaries. These patterns replicate across databases, attribution methods, and strategic domains, including AI, biotechnology, energy, and semiconductors. Nations lead science by reading the landscape or by reshaping it, and the institutional requirements for each strategy lie in tension. The distribution of these strategies promises to shape the global structure of technological leadership for decades.

2026-03-01T14:01:50Z 16 pages, 4 figures Jeffrey W. Lockhart Jamshid Sourati Feng Shi James Evans http://arxiv.org/abs/2603.00807v1 Consensus and fragmentation in academic publication preferences 2026-02-28T21:03:44Z

Academic publishing requires solving a collective coordination problem: among thousands of possible publication venues, which deserve a community's attention? A clear consensus helps scholars allocate attention, match submissions to appropriate outlets, and evaluate scholars for hiring and promotion. Yet preferences are not centrally coordinated--they emerge within each field over time. Here we ask whether all fields have arrived at similar solutions to this coordination problem, and whether preferences vary systematically with individual characteristics. Using an adaptive survey of 3,510 US tenure-track faculty yielding 163,002 pairwise comparisons across 8,044 venues, we show that fields occupy a wide spectrum of coordination. Economics, Chemistry, and Physics exhibit strong consensus, with respondents agreeing on elite venues and accurately predicting one another's choices. Computer Science and Engineering show fragmented preferences distributed across hundreds of outlets with minimal overlap. Within fields, preferences correlate with institutional prestige--faculty at elite institutions prefer higher-ranked venues--and with gender, as men prefer higher-ranked venues than women even after accounting for prestige and career stage. Scholars realize their personal preferences more successfully than their respective fields' consensus preferences, indicating that heterogeneity, not just selective hierarchy, shapes publishing outcomes. Journal Impact Factors explain only 64% of preference choices, systematically undervaluing what fields actually prefer. These results quantify how publication preferences vary across the structural diversity of academic fields.

2026-02-28T21:03:44Z 14 pages, 5 figures, followed by extensive supporting information appendices Ian Van Buskirk Marilena Hohmann Ekaterina Landgren Johan Ugander Aaron Clauset Daniel B. Larremore http://arxiv.org/abs/2602.24229v1 Science Fiction and Fantasy in Wikipedia: Exploring Structural and Semantic Cues 2026-02-27T17:56:25Z

Identifying which Wikipedia articles are related to science fiction, fantasy, or their hybrids is challenging because genre boundaries are porous and frequently overlap. Wikipedia nonetheless offers machine-readable structure beyond text, including categories, internal links (wikilinks), and statements if corresponding Wikidata items. However, each of these signals reflects community conventions and can be biased or incomplete. This study examines structural and semantic features of Wikipedia articles that can be used to identify content related to science fiction and fantasy (SF/F).

2026-02-27T17:56:25Z Supplementary materials: https://data.lewoniewski.info/fantasy/ Włodzimierz Lewoniewski Milena Stróżyna Izabela Czumałowska Elżbieta Lewańska http://arxiv.org/abs/2602.23941v1 EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates 2026-02-27T11:43:17Z

This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.

2026-02-27T11:43:17Z Accepted at LREC 2026 Ludovic Moncla Pierre Nugues Thierry Joliveau Katherine McDonough