https://arxiv.org/api/HueOAWgaa331Ns8q5zwXffSmcUw2026-03-22T08:57:47Z58701515http://arxiv.org/abs/2603.08935v2PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration2026-03-11T16:00:39ZPathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.2026-03-09T21:09:24ZAbdul Rehman AkbarSamuel Wales-McGrathAlejadro LevyaLina GokhaleRajendra SinghWei ChenAnil ParwaniMuhammad Khalid Khan Niazihttp://arxiv.org/abs/2603.10876v1An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?2026-03-11T15:24:20ZSubject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.2026-03-11T15:24:20Z9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)Jennifer D'SouzaSameer SadruddinMaximilian KählerAndrea SalfingerLuca ZaccagnaFrancesca IncittiLauro SnidaroOsma Suominenhttp://arxiv.org/abs/2603.10285v1Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums2026-03-11T00:07:32ZRecent digitisation efforts in natural history museums have produced large volumes of collection data, yet their scale and scientific complexity often hinder public access and understanding. Conventional data management tools, such as databases, restrict exploration through keyword-based search or require specialised schema knowledge. This paper presents a system design that uses conversational AI to query nearly 1.7 million digitised specimen records from the life-science collections of the Australian Museum. Designed and developed through a human-centred design process, the system contains an interactive map for visual-spatial exploration and a natural-language conversational agent that retrieves detailed specimen data and answers collection-specific questions. The system leverages function-calling capabilities of contemporary large language models to dynamically retrieve structured data from external APIs, enabling fast, real-time interaction with extensive yet frequently updated datasets. Our work provides a new approach of connecting large museum collections with natural language-based queries and informs future designs of scientific AI agents for natural history museums.2026-03-11T00:07:32Z25 pages, 9 figuresYiyuan WangAndrew JohnstonZoë SadokierskiRhiannon StephensShane T. Ahyonghttp://arxiv.org/abs/2308.07162v3Evolution of funding for collaborative health research towards higher-level patient-oriented research. A comparison of the European Union Framework Programmes to the program funding by the United States National Institutes of Health2026-03-10T21:18:40ZPublic research funding agencies increasingly seek to steer health research toward higher levels of translation and societal relevance. Yet it remains unclear to what extent such policy shifts are effectively implemented and reflected in funded projects and scientific outputs. This study examines evolution and changes in the orientation of health research portfolios since 2008 within European funding (Framework Programmes FP7 and Horizon 2020 funding for collaborative health research, FP-HR, and ERC Life Sciences grants), in comparison to NIH funding for collaborative research (P01, U01, and UM1). Using large-scale text analysis and supervised classification, we analyze both project descriptions and the associated scientific publications. At the project level, the EU FP-HR show pronounced shifts toward population-level, diagnostic, and health systems-oriented research, whereas investigator-driven ERC life sciences, NIH P01 and U01, display greater stability with a predominance of basic biomedical research. Publication-level analyses reveal more moderate changes, with basic biomedical research remaining a central component including in EU FP-HR, indicating partial translation of funding priorities into outputs. By jointly analyzing projects and publications, this study identifies and distinguishes between changes in funder expectations and realized research trajectories, highlighting how strategic funding shapes research portfolios within enduring epistemic and institutional constraints.2023-08-14T14:17:34ZDavid Fajardo-OrtizBart ThijsWolfgang GlanzelKarin R. Sipidohttp://arxiv.org/abs/2603.08012v1Structure-Preserving Graph Contrastive Learning for Mathematical Information Retrieval2026-03-09T06:36:34ZThis paper introduces Variable Substitution as a domain-specific graph augmentation technique for graph contrastive learning (GCL) in the context of searching for mathematical formulas. Standard GCL augmentation techniques often distort the semantic meaning of mathematical formulas, particularly for small and highly structured graphs. Variable Substitution, on the other hand, preserves the core algebraic relationships and formula structure. To demonstrate the effectiveness of our technique, we apply it to a classic GCL-based retrieval model. Experiments show that this straightforward approach significantly improves retrieval performance compared to generic augmentation strategies. We release the code on GitHub.\footnote{https://github.com/lazywulf/formula_ret_aug}.2026-03-09T06:36:34ZChun-Hsi KuHung-Hsuan Chenhttp://arxiv.org/abs/2603.06839v1From Job Postings to Curriculum Decisions: Using AI to Generate Workforce Intelligence for MSW Program Planning2026-03-06T20:02:39ZSocial work programs lack systematic methods to align curricula with employer expectations, typically relying on advisory input and alumni surveys rather than direct analysis of workforce requirements. This paper presents a case study demonstrating how one MSW program used artificial intelligence tools to generate organizational intelligence from job posting data for curriculum planning. Using a locally deployed language model, we classified over 40,000 job postings for MSW relevance and alignment with eight practice specializations, then extracted skills, therapeutic modalities, and technology competencies. Interpersonal Practice dominated the employment landscape, followed by Children, Youth, and Families. Clinical Assessment and Case Management emerged as cross-cutting competencies. Macro-level specializations showed co-occurrence patterns among partially aligned positions that largely disappeared among positions requiring MSW credentials specifically. Trauma-informed care appeared in management and evaluation roles, reflecting its expansion from clinical modality to organizational framework. The methodology demonstrates a transferable approach that other programs can adapt for strategic planning, and the findings illustrate the type of intelligence such analysis can yield. The patterns identified entered faculty deliberation as one input among many, interpreted by stakeholders with contextual knowledge no dataset can fully capture.2026-03-06T20:02:39ZBarbara S. HiltzBryan G. VictorBrian E. Perronhttp://arxiv.org/abs/2603.06814v1AI-Assisted Curation of Conference Scholarship: Compiling, Structuring, and Analyzing Two Decades of Presentations at the Society for Social Work and Research2026-03-06T19:19:29ZPurpose: This study developed a comprehensive database of presentation abstracts from the Society for Social Work and Research (SSWR) Annual Conference and examined patterns in research methodology, authorship, collaboration, and institutional participation over two decades.
Method: Abstract metadata was compiled from the SSWR Confex conference management system for presentations from 2005 to 2026 using web scraping. A small language model (gpt-oss:20b) performed classification and extraction tasks on abstracts, including categorization of methodologies and parsing of author affiliations, with human review at each major stage to ensure accuracy.
Results: The database contains 23,793 presentations with 69,924 author records representing 20,779 unique researchers from 4,049 institutions across 93 countries. Annual conference presentations increased from 423 in 2005 to 1,935 in 2026, representing a compound annual growth rate of 7.5%. Quantitative methods predominated (61.1%), followed by qualitative approaches (23.4%), mixed methods (9.1%), and reviews (5.4%). The mean number of authors per presentation increased from 2.22 in 2005 to 3.31 in 2026. International participation grew from 4.5% to 13.5% of author affiliations over the observation period.
Discussion: Findings indicate substantial growth in SSWR conference participation, alongside increased collaboration and international engagement. The methodological distribution reveals continued quantitative predominance with growing qualitative representation. This database provides research infrastructure for systematic hypothesis testing about research priorities and disciplinary development over time, enabling analyses that inform both scholarship and conference planning.2026-03-06T19:19:29ZBrian PerronBryan VictorZia Qihttp://arxiv.org/abs/2603.06436v1Rethinking Thematic Evolution in Science Mapping: An Integrated Framework for Longitudinal Analysis2026-03-06T16:16:04ZStrategic diagrams and co-word analysis are widely employed to examine the conceptual structure of scientific domains and their development over time. Yet a structural inconsistency characterises dominant longitudinal implementations: themes are detected through relational clustering in weighted networks, whereas their inter-temporal connections are commonly inferred from set-theoretic overlap among keywords or core documents. This study introduces a structurally integrated framework in which lineage reconstruction is embedded within the same weighted relational architecture that underpins cross-sectional detection. The approach models thematic continuity through graded document affiliation and a lineage-strength measure that combines directional coverage with centrality-weighted structural relevance, thereby conceptualising evolution as the reconfiguration of relational structures rather than simple lexical persistence. By aligning thematic detection and temporal modelling within a unified relational paradigm, the framework enhances the methodological coherence and interpretive robustness of longitudinal science mapping.2026-03-06T16:16:04ZMassimo AriaLuca D'AnielloMichelangelo MisuracaMaria Spanohttp://arxiv.org/abs/2404.01800v3Sentiment Analysis of Citations in Scientific Articles Using ChatGPT: Identifying Potential Biases and Conflicts of Interest2026-03-06T09:10:59ZScientific articles play a crucial role in advancing knowledge and informing research directions. One key aspect of evaluating scientific articles is the analysis of citations, which provides insights into the impact and reception of the cited works. This article introduces the innovative use of large language models, particularly ChatGPT, for comprehensive sentiment analysis of citations within scientific articles. By leveraging advanced natural language processing (NLP) techniques, ChatGPT can discern the nuanced positivity or negativity of citations, offering insights into the reception and impact of cited works. Furthermore, ChatGPT's capabilities extend to detecting potential biases and conflicts of interest in citations, enhancing the objectivity and reliability of scientific literature evaluation. This study showcases the transformative potential of artificial intelligence (AI)-powered tools in enhancing citation analysis and promoting integrity in scholarly research.2024-04-02T09:59:49ZWalid Haririhttp://arxiv.org/abs/2603.05984v1Fostering Knowledge Infrastructures in Science Communication and Aerospace Engineering2026-03-06T07:31:01ZKnowledge infrastructures are defined as robust networks of people, artifacts, and institutions that generate, share and maintain specific knowledge. Yet, many domains are fragmented and far from robustly networked, such as science communication or aerospace engineering. While FAIR (Findable, Accessible, Interoperable, Reusable) data management tools exist, their adoption in these domains is limited. Several challenges inhibit this adoption, from complex heterogeneous data formats to lack of structured support to outright incentives against collaboration or legal barriers. This doctoral work outlines how to foster underdeveloped knowledge infrastructures with the use-cases of science communication and aerospace engineering. By analyzing these problems and identifying available solutions, tool-supported workflows towards collaborative infrastructure can be implemented and evaluated. These include human-in-the-loop artificial intelligence (AI)-supported workflows for information extraction and processing, wiki- and knowledge-graph-based digital libraries, and stakeholder-requirement-driven interfaces. While these developed tools for workflow automation and knowledge representation show promise, significant challenges remain. Future work will have to go beyond technical problem-solving and address the societal and legal barriers to unlock the particular domains. Beyond that, advocates of emerging knowledge infrastructures in any domain are welcome to apply the findings of this work to foster the networking of available knowledge.2026-03-06T07:31:01Z4 pages, 1 figure, accepted at JCDL 20252025 ACM/IEEE Joint Conference on Digital Libraries (JCDL)Tim Wittenborg10.1109/JCDL67857.2025.00052http://arxiv.org/abs/2603.05192v1Aerospace.Wikibase: Towards a Knowledge Infrastructure for Aerospace Engineering2026-03-05T14:03:48ZWhile Aerospace engineering can benefit greatly from collaborative knowledge management, its infrastructure is still fragmented. Bridging this divide is essential to reduce the current practice of redundant work and to address the challenges posed by the rapidly growing volume of aviation data. This study presents an accessible platform, built on Wikibase, to enable collaborative sharing and curation of aerospace engineering knowledge, initially populated with data from a recent systematic literature review. As a solid foundation, the Aerospace.Wikibase provides over 700 terms related to processes, software and data, openly available for future extension. Linking project-specific concepts to persistent, independent infrastructure enables aerospace engineers to collaborate on universal knowledge without risking the appropriation of project information, thereby promoting sustainable solutions to modern challenges while acknowledging the limitations of the industry.2026-03-05T14:03:48Z4 pages, 1 figure, submitted to JCDL 2025Tim WittenborgIldar BaimuratovJamal Eldemashkihttp://arxiv.org/abs/2603.05177v1SWARM-SLR AIssistant: A Unified Framework for Scalable Systematic Literature Review Automation2026-03-05T13:44:45ZDespite a growing ecosystem of tools supporting Systematic Literature Reviews (SLRs), integrating them into user-friendly workflows remains challenging. The Streamlined Workflow for Automating Machine-Actionable Systematic Literature Reviews (SWARM-SLR) unified the tool annotation and provided a cohesive yet modular workflow, but faced scalability and usability issues. We introduce the SWARM-SLR AIssistant, a unified framework that combines the SWARM-SLR's structured methodology with an agent-based assistant that integrates research tools in a modular interface. The first SWARM-SLR stage is integrated, enabling conversational, LLM-guided support and persistent data storage. To address the tool assessment bottleneck, we propose a centralized tool registry that allows developers to annotate and register tools autonomously using a shared metadata schema. Preliminary evaluation shows improved usability, but challenges remain in balancing efficiency, accessibility, and transparency. Further development is needed to realize scalable SLR automation.2026-03-05T13:44:45Z4 pages, 3 figures, submitted to JCDL 2025Tim WittenborgAllard OelenManuel Prinzhttp://arxiv.org/abs/2602.01712v2Mapping a Decade of Avian Influenza Research (2014-2023): A Scientometric Analysis from Web of Science2026-03-05T07:09:02ZThis scientometric study analyzes Avian Influenza research from 2014 to 2023 using bibliographic data from the Web of Science database. We examined publication trends, sources, authorship, collaborative networks, document types, and geographical distribution to gain insights into the global research landscape. Results reveal a steady increase in publications, with high contributions from Chinese and American institutions. Journals such as PLoS One and the Journal of Virology published the highest number of studies, indicating their influence in this field. The most prolific institutions include the Chinese Academy of Sciences and the University of Hong Kong, while the College of Veterinary Medicine at South China Agricultural University emerged as the most productive department. China and the USA lead in publication volume, though developed nations like the United Kingdom and Germany exhibit a higher rate of international collaboration. "Articles" are the most common document type, constituting 84.6% of the total, while "Reviews" account for 7.6%. This study provides a comprehensive view of global trends in Avian Influenza research, emphasizing the need for collaborative efforts across borders.2026-02-02T06:37:20Z24 pages, 7 figures, Research ArticleJournal of Health Information Research, 3(1), 1 - 24, 2026Muneer AhmadUndie Felicia NkatvAmrita SharmaGorrety Maria JumaNicholas KamogaJulirine Nakanwagi10.47524/jhir.v3i1.25http://arxiv.org/abs/2512.10268v2Balancing the Byline: Exploring Gender and Authorship Patterns in Canadian Science Publishing Journals2026-03-04T20:33:00ZCanada is internationally recognized for its leadership in science and its commitment to equity, diversity, and inclusion (EDI) in STEM (science, technology, engineering, and math) fields. Despite this leadership, limited research has examined gender disparities in scientific publishing within the Canadian context. This study analyzes over 67,000 articles published in 24 Canadian Science Publishing (CSP) journals between 2010 and 2021 to better understand patterns of gender representation. Findings show that women accounted for less than one-third of published authors across CSP journals. Representation varied by discipline, with higher proportions of women in biomedical sciences and lower proportions of women in engineering - trends that mirror broader national and global patterns. Notably, the proportion of women submitting manuscripts closely matched those published, suggesting that broader workforce disparities may play a larger role than publication bias. Women were less likely to be solo authors or to hold prominent authorship positions, such as first or last author - roles typically associated with research leadership and career advancement. These findings point to the need for a two-fold response: continued efforts to address systemic barriers to women's participation in science, and a review of publishing practices to ensure equitable access, recognition, and inclusion for all researchers.2025-12-11T04:14:12ZSupplementary Information includedEden J. HennesseyAmanda DesnoyersMargaret ChristAdrianna TassoneSkye HennesseyBianca DreyerAlex JayPatricia SanchezShohini Ghosehttp://arxiv.org/abs/2511.11953v2National and state-level datasets of United States forensic DNA databases 2001-20252026-03-04T02:46:52ZForensic DNA databases in the United States have expanded substantially over the past two decades. However, comprehensive, harmonized data describing database structure and composition remain limited. This dataset series documents forensic DNA infrastructure across national and state levels from 2001 to 2025. It includes a reconstructed time series of monthly National DNA Index System (NDIS) statistics from FBI archives, capturing counts of offender, arrestee, and forensic profiles, participating laboratory totals, and investigations aided. A complementary dataset compiles publicly available state-level statistics and policy metadata on arrestee collection laws, familial search practices, and DNA collection statutes across all 50 states. A third dataset provides standardized demographic and annual collection data obtained through previously published public records requests, including sex and racial composition where reported. Together, these resources provide a foundation for studying the historical development of forensic DNA systems in the U.S., enabling longitudinal and cross-sectional analyses of database growth, policy variation, and reporting practices across jurisdictions.2025-11-15T00:01:02Z12 pages, 7 figuresYemko PryorVirum RankaJoao Pedro DonadioSamantha C. MullerJenna WilsonTina Lasisi