https://arxiv.org/api/D8SKN61ufHr79GNk48l9UQkhkJ42026-03-22T12:04:55Z58704515http://arxiv.org/abs/2602.22529v1Generative Agents Navigating Digital Libraries2026-02-26T02:08:39ZIn the rapidly evolving field of digital libraries, the development of large language models (LLMs) has opened up new possibilities for simulating user behavior. This innovation addresses the longstanding challenge in digital library research: the scarcity of publicly available datasets on user search patterns due to privacy concerns. In this context, we introduce Agent4DL, a user search behavior simulator specifically designed for digital library environments. Agent4DL generates realistic user profiles and dynamic search sessions that closely mimic actual search strategies, including querying, clicking, and stopping behaviors tailored to specific user profiles. Our simulator's accuracy in replicating real user interactions has been validated through comparisons with real user data. Notably, Agent4DL demonstrates competitive performance compared to existing user search simulators such as SimIIR 2.0, particularly in its ability to generate more diverse and context-aware user behaviors.2026-02-26T02:08:39ZProceedings of the 26th International Conference on Asia-Pacific Digital Libraries, ICADL 2024Saber ZerhoudiMichael Granitzer10.1007/978-981-96-0865-2_14http://arxiv.org/abs/2505.06721v3Behind the Byline: A Large-Scale Study of Scientific Author Contributions2026-02-25T18:32:55ZUnderstanding how co-authors distribute credit is critical for accurately assessing scholarly collaboration. In this study, we uncover the implicit structures within scientific teamwork by systematically analyzing author contributions across a large corpus of research publications. We introduce a computational framework designed to convert free-text contribution statements into 14 standardized CRediT categories, identifying clear and consistent positional patterns in task assignments. By analyzing over 400,000 scientific articles from prominent sources such as PLOS One and Nature, we extracted and standardized more than 5.6 million author-task assignments corresponding to 1.58 million author mentions. Our analysis reveals substantial disparities in workload distribution. Notably, in small teams with three co-authors, the most engaged contributor performs over three times more tasks than the least engaged, a disparity that grows linearly with team size. This demonstrates a consistent pattern of central and peripheral roles within modern collaborative teams. Moreover, our analysis shows distinct positional biases in task allocation: technical responsibilities, such as software development and formal analysis, broadly fall to authors positioned earlier in the author list, whereas managerial tasks, including supervision and funding acquisition, increasingly concentrate among authors positioned toward the end. This gradient underscores a significant division of labor, where early-listed authors mainly undertake most hands-on activities. In contrast, senior authors mostly assume roles involving leadership and oversight. Our findings highlight the structured and hierarchical organization within scholarly collaborations, providing deeper insights into the specific roles and dynamics that govern academic teamwork2025-05-10T18:02:55Z15 (include references and appendix sections) and 8 figures (and 1 in the appendix section)Itai AssrafMichael Firehttp://arxiv.org/abs/2602.21926v1Bridging Through Absence: How Comeback Researchers Bridge Knowledge Gaps Through Structural Re-emergence2026-02-25T14:04:03ZUnderstanding the role of researchers who return to academia after prolonged inactivity, termed "comeback researchers", is crucial for developing inclusive models of scientific careers. This study investigates the structural and semantic behaviors of comeback researchers, focusing on their role in cross-disciplinary knowledge transfer and network reintegration. Using the AMiner citation dataset, we analyze 113,637 early-career researchers and identify 1,425 comeback cases based on a three-year-or-longer publication gap followed by renewed activity. We find that comeback researchers cite 126% more distinct communities and exhibit 7.6% higher bridging scores compared to dropouts. They also demonstrate 74% higher gap entropy, reflecting more irregular yet strategically impactful publication trajectories. Predictive models trained on these bridging- and entropy-based features achieve a 97% ROC-AUC, far outperforming the 54% ROC-AUC of baseline models using traditional metrics like publication count and h-index. Finally, we substantiate these results via a multi-lens validation. These findings highlight the unique contributions of comeback researchers and offer data-driven tools for their early identification and institutional support.2026-02-25T14:04:03ZPreprint; 25 pages, 14 figures, 7 tables, Submitted to Scientometrics 2025Somyajit ChakrabortyAngshuman JanaAvijit Gayenhttp://arxiv.org/abs/2602.22276v1EmpiRE-Compass: A Neuro-Symbolic Dashboard for Sustainable and Dynamic Knowledge Exploration, Synthesis, and Reuse2026-02-25T09:58:20ZSoftware engineering (SE) and requirements engineering (RE) face a significant increase in secondary studies, particularly literature reviews (LRs), due to the ever-growing number of scientific publications. Generative artificial intelligence (GenAI) exacerbates this trend by producing LRs rapidly but often at the expense of quality, rigor, and transparency. At the same time, secondary studies often fail to share underlying data and artifacts, limiting replication and reuse. This paper introduces EmpiRE-Compass, a neuro-symbolic dashboard designed to lower barriers for accessing, replicating, and reusing LR data. Its overarching goal is to demonstrate how LRs can become more sustainable by semantically structuring their underlying data in research knowledge graphs (RKGs) and by leveraging large language models (LLMs) for easy and dynamic access, replication, and reuse. Building on two RE use cases, we developed EmpiRE-Compass with a modular system design and workflows for curated and custom competency questions. The dashboard is freely available online, accompanied by a demonstration video. To manage operational costs, a limit of 25 requests per IP address per day applies to the default LLM (GPT-4o mini). All source code and documentation are released as an open-source project to foster reuse, adoption, and extension. EmpiRE-Compass provides three core capabilities: (1) Exploratory visual analytics for curated competency questions; (2) Neuro-symbolic synthesis for custom competency questions; and (3) Reusable knowledge with all queries, analyses, and results openly available. By unifying RKGs and LLMs in a neuro-symbolic dashboard, EmpiRE-Compass advances sustainable LRs in RE, SE, and beyond. It lowers technical barriers, fosters transparency and reproducibility, and enables collaborative, continuously updated, and reusable LRs2026-02-25T09:58:20Z7 pages, 1 figure, Accepted at 32nd International Working Conference on Requirements Engineering: Foundations for Software QualityOliver KarrasAmirreza AlastiLena JohnSushant AggarwalYücel Celikhttp://arxiv.org/abs/2602.19711v1A Three-stage Neuro-symbolic Recommendation Pipeline for Cultural Heritage Knowledge Graphs2026-02-23T11:02:13ZThe growing volume of digital cultural heritage resources highlights the need for advanced recommendation methods capable of interpreting semantic relationships between heterogeneous data entities. This paper presents a complete methodology for implementing a hybrid recommendation pipeline integrating knowledge-graph embeddings, approximate nearest-neighbour search, and SPARQL-driven semantic filtering. The work is evaluated on the JUHMP (Jagiellonian University Heritage Metadata Portal) knowledge graph developed within the CHExRISH project, which at the time of experimentation contained ${\approx}3.2$M RDF triples describing people, events, objects, and historical relations affiliated with the Jagiellonian University (Kraków, PL). We evaluate four embedding families (TransE, ComplEx, ConvE, CompGCN) and perform hyperparameter selection for ComplEx and HNSW. Then, we present and evaluate the final three-stage neuro-symbolic recommender. Despite sparse and heterogeneous metadata, the approach produces useful and explainable recommendations, which were also proven with expert evaluation.2026-02-23T11:02:13Z15 pages, 1 figure; submitted to ICCS 2026 conferenceKrzysztof KuttElżbieta SrokaOleksandra IshchukLuiz do Valle Mirandahttp://arxiv.org/abs/2602.19698v1Iconographic Classification and Content-Based Recommendation for Digitized Artworks2026-02-23T10:44:27ZWe present a proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods. The prototype implements a four-stage workflow for classification and recommendation, which integrates YOLOv8 object detection with algorithmic mappings to Iconclass codes, rule-based inference for abstract meanings, and three complementary recommenders (hierarchical proximity, IDF-weighted overlap, and Jaccard similarity). Although more engineering is still needed, the evaluation demonstrates the potential of this solution: Iconclass-aware computer vision and recommendation methods can accelerate cataloging and enhance navigation in large heritage repositories. The key insight is to let computer vision propose visible elements and to use symbolic structures (Iconclass hierarchy) to reach meaning.2026-02-23T10:44:27Z14 pages, 7 figures; submitted to ICCS 2026 conferenceKrzysztof KuttMaciej Baczyńskihttp://arxiv.org/abs/2602.19197v1How Ten Publishers Retract Research2026-02-22T13:58:22ZRetractions are the primary mechanism for correcting the scholarly record, yet publishers differ markedly in how they use them. We present a bibliometric analysis of 46,087 retractions across 10 major publishers using data from the Retraction Watch database (1997-2026), examining retraction rates, reasons, temporal trends, and geographic distributions, among other dimensions. Normalized retraction rates vary by two orders of magnitude, from Elsevier's 3.97 per 10,000 publications to Hindawi's 320.02. China-affiliated authors account for the largest share of retractions at every publisher. Retraction lags and reason profiles also vary widely across publishers. Among the ten publishers, ACM is an outlier in its retraction profile. ACM's normalized rate is mid-range (5.65), yet 98.3% of its 354 retractions are related to one incident. Seven of the ten most common global retraction reasons (including misconduct, plagiarism, and data concerns) are entirely absent from ACM's record. ACM's first retraction dates to 2020, despite a catalog dating to 1997. ACM self-describes its retraction threshold as "extremely high." We discuss this threshold in relation to the COPE retraction guidelines and the implications of ACM's non-public dark archive of removed works.2026-02-22T13:58:22Z43 pages, 7 figures, 13 tablesJonas Oppenlaenderhttp://arxiv.org/abs/2602.19115v1How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders2026-02-22T10:12:20ZIn recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that LLMs can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how LLMs encode the concept of scientific quality through relevant monosemantic features extracted using sparse autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that LLMs encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how LLMs encapsulate concepts related to research quality.2026-02-22T10:12:20ZPresented at SESAME 2025: Smarter Extraction of ScholArly MEtadata using Knowledge Graphs and Language Models, @ JCDL 2025Michael McCoubreyAngelo SalatinoFrancesco OsborneEnrico Mottahttp://arxiv.org/abs/2602.18935v1Responsible Intelligence in Practice: A Fairness Audit of Open Large Language Models for Library Reference Services2026-02-21T19:05:03ZAs libraries explore large language models (LLMs) as a scalable layer for reference services, a core fairness question follows: can LLM-based services support all patrons fairly, regardless of demographic identity? While LLMs offer great potential for broadening access to information assistance, they may also reproduce societal biases embedded in their training data, potentially undermining libraries' commitments to impartial service. In this chapter, we apply a systematic evaluation approach that combines diagnostic classification to detect systematic differences with linguistic analysis to interpret their sources. Across three widely used open models (Llama-3.1 8B, Gemma-2 9B, and Ministral 8B), we find no compelling evidence of systematic differentiation by race/ethnicity, and only minor evidence of sex-linked differentiation in one model. We discuss implications for responsible AI adoption in libraries and the importance of ongoing monitoring in aligning LLM-based services with core professional values.2026-02-21T19:05:03ZInvited chapter for the edited volume Artificial Intelligence and Social Justice Intersections in Library and Information Studies: Challenges and Opportunities (Emerald Group Publishing, in preparation)Haining WangJason ClarkAngelica Peñahttp://arxiv.org/abs/2602.18264v1A Curated Literature Database for Monitoring More Than 30 Years of Ansys Granta Product Usage2026-02-20T14:47:27ZEngineering and materials software is increasingly difficult to track in the scholarly and technical literature because publication volume is growing rapidly and software citation practices remain inconsistent. This is particularly true for the Ansys Granta product family, which is used for materials education, materials and process selection, sustainability-driven design, and enterprise materials information management. We present a structured and reproducible framework to consolidate evidence of \emph{operational} Granta usage and to support quantitative monitoring of adoption patterns, application domains, and technical impact. The framework is implemented as a curated reference database in \textit{Ansys Granta MI Enterprise}: bibliographic metadata are ingested semi-automatically (e.g., via DOI and citation-file parsing) and complemented by expert curation of usage descriptors (product, context, application domain, and technical depth), with relational links to authors and institutions. Downstream analytics are performed with Python, dashboards, and bibliometric/network visualization tools to enable reproducible querying and reporting. As of September~2025, the database contains more than 1{,}100 curated records spanning journals, conferences, theses, books, patents, standards, and reports, and supports rapid retrieval of validated case studies, reproducible literature reviews, and technology scouting. Example analyses highlight dominant domains, key institutions, and recurring integrations with CAD/CAE/FEM environments. Overall, the approach converts heterogeneous software-usage evidence into structured, analyzable knowledge to improve visibility of engineering software impact and to support evidence-based assessment and strategic decision-making.2026-02-20T14:47:27ZDavid Mercierhttp://arxiv.org/abs/2602.21249v1Quality of Descriptive Information on Cultural Heritage Objects: Definition and Empirical Evaluation2026-02-19T10:12:26ZEffective data processing depends on the quality of the underlying data. However, quality issues such as inconsistencies and uncertainties, can significantly impede the processing and subsequent use of data. Despite the centrality of data quality to a wide range of computational tasks, there is currently no broadly accepted, domain-independent consensus on the definition of data quality. Existing frameworks primarily define data quality in ways that are tailored to specific domains, data types, or contexts of use. Although quality assessment frameworks exist for specific domains, such as electronic health record data and linked data, corresponding approaches for descriptive information about cultural heritage objects remain underdeveloped. Moreover, existing quality definitions are often theoretical in nature and lack empirical validation based on real-world data problems. In this paper, we address these limitations by first defining a set of quality dimensions specifically designed to capture the characteristics of descriptive information about cultural heritage objects. Our definition is based on an in-depth analysis of existing dimensions and is illustrated through domain-specific examples. We then evaluate the practical applicability of our proposed quality definition using a curated set of real-world data quality problems from the cultural heritage domain. This empirical evaluation substantiates our definition of data quality, resulting in a comprehensive definition of data quality in this domain.2026-02-19T10:12:26ZpreprintMarkus MatoniArno KesperGabriele Taentzerhttp://arxiv.org/abs/2603.00107v1SciKGDash: The Scientific Knowledge Graph Dashboard for Supporting Knowledge Curation2026-02-18T10:37:32ZResearch knowledge graphs (RKGs) have emerged as essential technology for organizing scientific knowledge, but their success depends heavily on the quality of their underlying content. Knowledge curation is a critical task to ensure the quality of (research) knowledge graphs ((R)KGs), with human curation being the gold standard despite its time- and resource-intensive nature. Automated methods, while efficient, lack the precision of human expertise. Hybrid approaches, combining automated processes with human oversight, offer a promising solution to this challenge. Dashboards can act as supportive tools in hybrid curation approaches, offering real-time updates and visual overviews. This paper presents an action research study, conducted in collaboration with the Curation and Community Building (C&CB) team of the Open Research Knowledge Graph (ORKG), to explore the development of a dashboard, called SciKGDash, designed to support knowledge curation of the ORKG. SciKGDash serves as a minimum viable product (MVP) tailored to the needs of the C&CB team, with potential for adaptation to other (R)KGs. An experiment with 15 participants demonstrated the usability of SciKGDash, with successful completion of 4 out of 5 curation tasks in under 5 minutes. In addition, SciKGDash received a positive user experience rating (UEQ score of 1.93). While the tailored solution proved effective for the ORKG, the research also highlights limitations in applying specific quality metrics across diverse (R)KGs. Future work should focus on identifying common quality metrics and enhancing SciKGDash with user-friendly features for querying customized quality metrics. Overall, knowledge curation in RKGs remains an under-explored field, warranting further research.2026-02-18T10:37:32Z2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2025, pp. 187-196Lena JohnSören AuerOliver Karras10.1109/JCDL67857.2025.00030http://arxiv.org/abs/2510.22389v2Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?2026-02-17T13:09:07ZPrevious research has shown that journal article quality ratings from the cloud based Large Language Model (LLM) families ChatGPT and Gemini and the medium sized open weights LLM Gemma3 27b correlate moderately with expert research quality scores. This article assesses whether other medium sized LLMs, smaller LLMs, and reasoning models have similar abilities. This is tested with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1 on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. Few-shot and score averaging approaches are also evaluated. The results suggest that medium-sized LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Reasoning models did not have a clear advantage. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and there is weak evidence that few-shot prompts (four examples) tend to help. Overall, the results show, for the first time, that smaller LLMs >4b have a substantial capability to rate journal articles for research quality, especially if score averaging is used, but that reasoning does not give an advantage for this task; it is therefore not recommended because it is slow. The use of LLMs to support research evaluation is now more credible since multiple variants have a similar ability, including many that can be deployed offline in a secure environment without substantial computing resources.2025-10-25T18:12:41ZThelwall, M. & Mohammadi, E. (2026). Can small and reasoning Large Language Models score journal articles for research quality and do averaging and few-shot help? ScientometricsMike ThelwallEhsan Mohammadihttp://arxiv.org/abs/2510.01783v2PreprintToPaper dataset: connecting bioRxiv preprints with journal publications2026-02-17T12:51:57ZThe PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. We selected the two periods to capture preprint-publication dynamics before and during the COVID-19 pandemic while avoiding transitional years. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. In addition to the main dataset, a version-history subset provides all available versions of preprints within the two selected periods, enabling analysis of how preprints evolve over time. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (posted on a preprint server), and Gray Zone (potentially published in a journal but unlinked). To enhance reliability, title and author similarity scores were computed, and a human-annotated subset of 299 records was created to evaluate Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and the corresponding journal articles. The dataset is publicly available in CSV format via Zenodo.2025-10-02T08:21:50Z13 pages, 3 figures, dataset paperScientific Data (2026)Fidan BadalovaJulian SienkiewiczPhilipp Mayr10.1038/s41597-026-06867-3http://arxiv.org/abs/2602.15413v1StatCounter: A Longitudinal Study of a Portable Scholarly Metric Display2026-02-17T08:13:55ZThis study explores a handheld, battery-operated e-ink device displaying Google Scholar citation statistics. The StatCounter places academic metrics into the flow of daily life rather than a desktop context. The work draws on a first-person, longitudinal auto-ethnographic inquiry examining how constant access to scholarly metrics influences motivation, attention, reflection, and emotional responses across work and non-work settings. The ambient proximity and pervasive availability of scholarly metrics invites frequent micro-checks, short reflective pauses, but also introduces moments of second-guessing when numbers drop or stagnate. Carrying the device prompts new narratives about academic identity, including a sense of companionship during travel and periods away from the office. Over time, the presence of the device turns metrics from an occasional reference into an ambient background of scholarly life. The study contributes insight into how situated, embodied access to academic metrics reshapes their meaning, and frames opportunities for designing tools that engage with scholarly evaluation in reflective ways.2026-02-17T08:13:55ZPublished in the proceedings of 10th ACM International Symposium on Pervasive Displays (PerDis '26)Jonas Oppenlaender10.1145/3797993.3798009