https://arxiv.org/api/HueOAWgaa331Ns8q5zwXffSmcUw 2026-03-22T08:57:47Z 5870 15 15 http://arxiv.org/abs/2603.08935v2 PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration 2026-03-11T16:00:39Z Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms. 2026-03-09T21:09:24Z Abdul Rehman Akbar Samuel Wales-McGrath Alejadro Levya Lina Gokhale Rajendra Singh Wei Chen Anil Parwani Muhammad Khalid Khan Niazi http://arxiv.org/abs/2603.10876v1 An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously? 2026-03-11T15:24:20Z Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work. 2026-03-11T15:24:20Z 9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) Jennifer D'Souza Sameer Sadruddin Maximilian Kähler Andrea Salfinger Luca Zaccagna Francesca Incitti Lauro Snidaro Osma Suominen http://arxiv.org/abs/2603.10285v1 Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums 2026-03-11T00:07:32Z Recent digitisation efforts in natural history museums have produced large volumes of collection data, yet their scale and scientific complexity often hinder public access and understanding. Conventional data management tools, such as databases, restrict exploration through keyword-based search or require specialised schema knowledge. This paper presents a system design that uses conversational AI to query nearly 1.7 million digitised specimen records from the life-science collections of the Australian Museum. Designed and developed through a human-centred design process, the system contains an interactive map for visual-spatial exploration and a natural-language conversational agent that retrieves detailed specimen data and answers collection-specific questions. The system leverages function-calling capabilities of contemporary large language models to dynamically retrieve structured data from external APIs, enabling fast, real-time interaction with extensive yet frequently updated datasets. Our work provides a new approach of connecting large museum collections with natural language-based queries and informs future designs of scientific AI agents for natural history museums. 2026-03-11T00:07:32Z 25 pages, 9 figures Yiyuan Wang Andrew Johnston Zoë Sadokierski Rhiannon Stephens Shane T. Ahyong http://arxiv.org/abs/2308.07162v3 Evolution of funding for collaborative health research towards higher-level patient-oriented research. A comparison of the European Union Framework Programmes to the program funding by the United States National Institutes of Health 2026-03-10T21:18:40Z Public research funding agencies increasingly seek to steer health research toward higher levels of translation and societal relevance. Yet it remains unclear to what extent such policy shifts are effectively implemented and reflected in funded projects and scientific outputs. This study examines evolution and changes in the orientation of health research portfolios since 2008 within European funding (Framework Programmes FP7 and Horizon 2020 funding for collaborative health research, FP-HR, and ERC Life Sciences grants), in comparison to NIH funding for collaborative research (P01, U01, and UM1). Using large-scale text analysis and supervised classification, we analyze both project descriptions and the associated scientific publications. At the project level, the EU FP-HR show pronounced shifts toward population-level, diagnostic, and health systems-oriented research, whereas investigator-driven ERC life sciences, NIH P01 and U01, display greater stability with a predominance of basic biomedical research. Publication-level analyses reveal more moderate changes, with basic biomedical research remaining a central component including in EU FP-HR, indicating partial translation of funding priorities into outputs. By jointly analyzing projects and publications, this study identifies and distinguishes between changes in funder expectations and realized research trajectories, highlighting how strategic funding shapes research portfolios within enduring epistemic and institutional constraints. 2023-08-14T14:17:34Z David Fajardo-Ortiz Bart Thijs Wolfgang Glanzel Karin R. Sipido http://arxiv.org/abs/2603.08012v1 Structure-Preserving Graph Contrastive Learning for Mathematical Information Retrieval 2026-03-09T06:36:34Z This paper introduces Variable Substitution as a domain-specific graph augmentation technique for graph contrastive learning (GCL) in the context of searching for mathematical formulas. Standard GCL augmentation techniques often distort the semantic meaning of mathematical formulas, particularly for small and highly structured graphs. Variable Substitution, on the other hand, preserves the core algebraic relationships and formula structure. To demonstrate the effectiveness of our technique, we apply it to a classic GCL-based retrieval model. Experiments show that this straightforward approach significantly improves retrieval performance compared to generic augmentation strategies. We release the code on GitHub.\footnote{https://github.com/lazywulf/formula_ret_aug}. 2026-03-09T06:36:34Z Chun-Hsi Ku Hung-Hsuan Chen http://arxiv.org/abs/2603.06839v1 From Job Postings to Curriculum Decisions: Using AI to Generate Workforce Intelligence for MSW Program Planning 2026-03-06T20:02:39Z Social work programs lack systematic methods to align curricula with employer expectations, typically relying on advisory input and alumni surveys rather than direct analysis of workforce requirements. This paper presents a case study demonstrating how one MSW program used artificial intelligence tools to generate organizational intelligence from job posting data for curriculum planning. Using a locally deployed language model, we classified over 40,000 job postings for MSW relevance and alignment with eight practice specializations, then extracted skills, therapeutic modalities, and technology competencies. Interpersonal Practice dominated the employment landscape, followed by Children, Youth, and Families. Clinical Assessment and Case Management emerged as cross-cutting competencies. Macro-level specializations showed co-occurrence patterns among partially aligned positions that largely disappeared among positions requiring MSW credentials specifically. Trauma-informed care appeared in management and evaluation roles, reflecting its expansion from clinical modality to organizational framework. The methodology demonstrates a transferable approach that other programs can adapt for strategic planning, and the findings illustrate the type of intelligence such analysis can yield. The patterns identified entered faculty deliberation as one input among many, interpreted by stakeholders with contextual knowledge no dataset can fully capture. 2026-03-06T20:02:39Z Barbara S. Hiltz Bryan G. Victor Brian E. Perron http://arxiv.org/abs/2603.06814v1 AI-Assisted Curation of Conference Scholarship: Compiling, Structuring, and Analyzing Two Decades of Presentations at the Society for Social Work and Research 2026-03-06T19:19:29Z Purpose: This study developed a comprehensive database of presentation abstracts from the Society for Social Work and Research (SSWR) Annual Conference and examined patterns in research methodology, authorship, collaboration, and institutional participation over two decades. Method: Abstract metadata was compiled from the SSWR Confex conference management system for presentations from 2005 to 2026 using web scraping. A small language model (gpt-oss:20b) performed classification and extraction tasks on abstracts, including categorization of methodologies and parsing of author affiliations, with human review at each major stage to ensure accuracy. Results: The database contains 23,793 presentations with 69,924 author records representing 20,779 unique researchers from 4,049 institutions across 93 countries. Annual conference presentations increased from 423 in 2005 to 1,935 in 2026, representing a compound annual growth rate of 7.5%. Quantitative methods predominated (61.1%), followed by qualitative approaches (23.4%), mixed methods (9.1%), and reviews (5.4%). The mean number of authors per presentation increased from 2.22 in 2005 to 3.31 in 2026. International participation grew from 4.5% to 13.5% of author affiliations over the observation period. Discussion: Findings indicate substantial growth in SSWR conference participation, alongside increased collaboration and international engagement. The methodological distribution reveals continued quantitative predominance with growing qualitative representation. This database provides research infrastructure for systematic hypothesis testing about research priorities and disciplinary development over time, enabling analyses that inform both scholarship and conference planning. 2026-03-06T19:19:29Z Brian Perron Bryan Victor Zia Qi http://arxiv.org/abs/2603.06436v1 Rethinking Thematic Evolution in Science Mapping: An Integrated Framework for Longitudinal Analysis 2026-03-06T16:16:04Z Strategic diagrams and co-word analysis are widely employed to examine the conceptual structure of scientific domains and their development over time. Yet a structural inconsistency characterises dominant longitudinal implementations: themes are detected through relational clustering in weighted networks, whereas their inter-temporal connections are commonly inferred from set-theoretic overlap among keywords or core documents. This study introduces a structurally integrated framework in which lineage reconstruction is embedded within the same weighted relational architecture that underpins cross-sectional detection. The approach models thematic continuity through graded document affiliation and a lineage-strength measure that combines directional coverage with centrality-weighted structural relevance, thereby conceptualising evolution as the reconfiguration of relational structures rather than simple lexical persistence. By aligning thematic detection and temporal modelling within a unified relational paradigm, the framework enhances the methodological coherence and interpretive robustness of longitudinal science mapping. 2026-03-06T16:16:04Z Massimo Aria Luca D'Aniello Michelangelo Misuraca Maria Spano http://arxiv.org/abs/2404.01800v3 Sentiment Analysis of Citations in Scientific Articles Using ChatGPT: Identifying Potential Biases and Conflicts of Interest 2026-03-06T09:10:59Z Scientific articles play a crucial role in advancing knowledge and informing research directions. One key aspect of evaluating scientific articles is the analysis of citations, which provides insights into the impact and reception of the cited works. This article introduces the innovative use of large language models, particularly ChatGPT, for comprehensive sentiment analysis of citations within scientific articles. By leveraging advanced natural language processing (NLP) techniques, ChatGPT can discern the nuanced positivity or negativity of citations, offering insights into the reception and impact of cited works. Furthermore, ChatGPT's capabilities extend to detecting potential biases and conflicts of interest in citations, enhancing the objectivity and reliability of scientific literature evaluation. This study showcases the transformative potential of artificial intelligence (AI)-powered tools in enhancing citation analysis and promoting integrity in scholarly research. 2024-04-02T09:59:49Z Walid Hariri http://arxiv.org/abs/2603.05984v1 Fostering Knowledge Infrastructures in Science Communication and Aerospace Engineering 2026-03-06T07:31:01Z Knowledge infrastructures are defined as robust networks of people, artifacts, and institutions that generate, share and maintain specific knowledge. Yet, many domains are fragmented and far from robustly networked, such as science communication or aerospace engineering. While FAIR (Findable, Accessible, Interoperable, Reusable) data management tools exist, their adoption in these domains is limited. Several challenges inhibit this adoption, from complex heterogeneous data formats to lack of structured support to outright incentives against collaboration or legal barriers. This doctoral work outlines how to foster underdeveloped knowledge infrastructures with the use-cases of science communication and aerospace engineering. By analyzing these problems and identifying available solutions, tool-supported workflows towards collaborative infrastructure can be implemented and evaluated. These include human-in-the-loop artificial intelligence (AI)-supported workflows for information extraction and processing, wiki- and knowledge-graph-based digital libraries, and stakeholder-requirement-driven interfaces. While these developed tools for workflow automation and knowledge representation show promise, significant challenges remain. Future work will have to go beyond technical problem-solving and address the societal and legal barriers to unlock the particular domains. Beyond that, advocates of emerging knowledge infrastructures in any domain are welcome to apply the findings of this work to foster the networking of available knowledge. 2026-03-06T07:31:01Z 4 pages, 1 figure, accepted at JCDL 2025 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Tim Wittenborg 10.1109/JCDL67857.2025.00052 http://arxiv.org/abs/2603.05192v1 Aerospace.Wikibase: Towards a Knowledge Infrastructure for Aerospace Engineering 2026-03-05T14:03:48Z While Aerospace engineering can benefit greatly from collaborative knowledge management, its infrastructure is still fragmented. Bridging this divide is essential to reduce the current practice of redundant work and to address the challenges posed by the rapidly growing volume of aviation data. This study presents an accessible platform, built on Wikibase, to enable collaborative sharing and curation of aerospace engineering knowledge, initially populated with data from a recent systematic literature review. As a solid foundation, the Aerospace.Wikibase provides over 700 terms related to processes, software and data, openly available for future extension. Linking project-specific concepts to persistent, independent infrastructure enables aerospace engineers to collaborate on universal knowledge without risking the appropriation of project information, thereby promoting sustainable solutions to modern challenges while acknowledging the limitations of the industry. 2026-03-05T14:03:48Z 4 pages, 1 figure, submitted to JCDL 2025 Tim Wittenborg Ildar Baimuratov Jamal Eldemashki http://arxiv.org/abs/2603.05177v1 SWARM-SLR AIssistant: A Unified Framework for Scalable Systematic Literature Review Automation 2026-03-05T13:44:45Z Despite a growing ecosystem of tools supporting Systematic Literature Reviews (SLRs), integrating them into user-friendly workflows remains challenging. The Streamlined Workflow for Automating Machine-Actionable Systematic Literature Reviews (SWARM-SLR) unified the tool annotation and provided a cohesive yet modular workflow, but faced scalability and usability issues. We introduce the SWARM-SLR AIssistant, a unified framework that combines the SWARM-SLR's structured methodology with an agent-based assistant that integrates research tools in a modular interface. The first SWARM-SLR stage is integrated, enabling conversational, LLM-guided support and persistent data storage. To address the tool assessment bottleneck, we propose a centralized tool registry that allows developers to annotate and register tools autonomously using a shared metadata schema. Preliminary evaluation shows improved usability, but challenges remain in balancing efficiency, accessibility, and transparency. Further development is needed to realize scalable SLR automation. 2026-03-05T13:44:45Z 4 pages, 3 figures, submitted to JCDL 2025 Tim Wittenborg Allard Oelen Manuel Prinz http://arxiv.org/abs/2602.01712v2 Mapping a Decade of Avian Influenza Research (2014-2023): A Scientometric Analysis from Web of Science 2026-03-05T07:09:02Z This scientometric study analyzes Avian Influenza research from 2014 to 2023 using bibliographic data from the Web of Science database. We examined publication trends, sources, authorship, collaborative networks, document types, and geographical distribution to gain insights into the global research landscape. Results reveal a steady increase in publications, with high contributions from Chinese and American institutions. Journals such as PLoS One and the Journal of Virology published the highest number of studies, indicating their influence in this field. The most prolific institutions include the Chinese Academy of Sciences and the University of Hong Kong, while the College of Veterinary Medicine at South China Agricultural University emerged as the most productive department. China and the USA lead in publication volume, though developed nations like the United Kingdom and Germany exhibit a higher rate of international collaboration. "Articles" are the most common document type, constituting 84.6% of the total, while "Reviews" account for 7.6%. This study provides a comprehensive view of global trends in Avian Influenza research, emphasizing the need for collaborative efforts across borders. 2026-02-02T06:37:20Z 24 pages, 7 figures, Research Article Journal of Health Information Research, 3(1), 1 - 24, 2026 Muneer Ahmad Undie Felicia Nkatv Amrita Sharma Gorrety Maria Juma Nicholas Kamoga Julirine Nakanwagi 10.47524/jhir.v3i1.25 http://arxiv.org/abs/2512.10268v2 Balancing the Byline: Exploring Gender and Authorship Patterns in Canadian Science Publishing Journals 2026-03-04T20:33:00Z Canada is internationally recognized for its leadership in science and its commitment to equity, diversity, and inclusion (EDI) in STEM (science, technology, engineering, and math) fields. Despite this leadership, limited research has examined gender disparities in scientific publishing within the Canadian context. This study analyzes over 67,000 articles published in 24 Canadian Science Publishing (CSP) journals between 2010 and 2021 to better understand patterns of gender representation. Findings show that women accounted for less than one-third of published authors across CSP journals. Representation varied by discipline, with higher proportions of women in biomedical sciences and lower proportions of women in engineering - trends that mirror broader national and global patterns. Notably, the proportion of women submitting manuscripts closely matched those published, suggesting that broader workforce disparities may play a larger role than publication bias. Women were less likely to be solo authors or to hold prominent authorship positions, such as first or last author - roles typically associated with research leadership and career advancement. These findings point to the need for a two-fold response: continued efforts to address systemic barriers to women's participation in science, and a review of publishing practices to ensure equitable access, recognition, and inclusion for all researchers. 2025-12-11T04:14:12Z Supplementary Information included Eden J. Hennessey Amanda Desnoyers Margaret Christ Adrianna Tassone Skye Hennessey Bianca Dreyer Alex Jay Patricia Sanchez Shohini Ghose http://arxiv.org/abs/2511.11953v2 National and state-level datasets of United States forensic DNA databases 2001-2025 2026-03-04T02:46:52Z Forensic DNA databases in the United States have expanded substantially over the past two decades. However, comprehensive, harmonized data describing database structure and composition remain limited. This dataset series documents forensic DNA infrastructure across national and state levels from 2001 to 2025. It includes a reconstructed time series of monthly National DNA Index System (NDIS) statistics from FBI archives, capturing counts of offender, arrestee, and forensic profiles, participating laboratory totals, and investigations aided. A complementary dataset compiles publicly available state-level statistics and policy metadata on arrestee collection laws, familial search practices, and DNA collection statutes across all 50 states. A third dataset provides standardized demographic and annual collection data obtained through previously published public records requests, including sex and racial composition where reported. Together, these resources provide a foundation for studying the historical development of forensic DNA systems in the U.S., enabling longitudinal and cross-sectional analyses of database growth, policy variation, and reporting practices across jurisdictions. 2025-11-15T00:01:02Z 12 pages, 7 figures Yemko Pryor Virum Ranka Joao Pedro Donadio Samantha C. Muller Jenna Wilson Tina Lasisi