https://arxiv.org/api/IhyAn2QUzBb1zgxwo56wSxd19Y42026-06-09T20:33:39Z6057015http://arxiv.org/abs/2606.09500v1Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture2026-06-08T13:51:04ZObjective. Large language models (LLMs) increasingly draft clinical research manuscripts, but their fluency can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items. Existing tools generate text without verifying it, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture that pairs generation with verification. Methods. The design rests on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism -- a deterministic, re-executable check where one suffices, and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills coordinated by one orchestrator, whose deterministic tier comprises 21 standard-library detectors. We evaluate it on three reproducible public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Results. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects. On 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a generic single-prompt LLM reviewer detected 11, its misses concentrated in generated-code, bibliography-internal, and style defects the prose does not expose. Conclusion. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript -- feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).2026-06-08T13:51:04Z28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): https://github.com/Aperivue/medsci-skills ; archived on Zenodo (concept DOI 10.5281/zenodo.20155321; v3.8.0 version DOI 10.5281/zenodo.20582972)Yoojin NamJinhoon JeongNamkug Kimhttp://arxiv.org/abs/2606.09479v1Optical Music Recognition for Real-World Manuscripts with Synthetic Data2026-06-08T13:38:48ZOptical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable of recognising notation at all levels of complexity. However, the impact of this progress has been limited by the visual domains of available training datasets, which are largely born-digital. Existing large collections of sheet music in libraries and other heritage institutions contain predominantly manuscripts, whose visual domains are highly diverse and different, so existing OMR systems fail when applied in the real world. These institutions are often resource-constrained, so large in-domain datasets cannot be expected. We provide a first baseline on real-world manuscripts with complex piano notation in the resource-constrained scenario. Using fine-grained music notation graph (MuNG) annotations and the Smashcima synthesis tool, we then show that while some direct transcriptions of in-domain data remain essential, domain adaptation using synthetic musical manuscript images brings significant improvement. Furthermore, the symbols used do not need to be in-domain, so the expensive fine-grained annotation can be avoided. We thus bring OMR closer to one of its stated goals: preserving and promoting musical cultural heritage.2026-06-08T13:38:48ZAccepted for publication at the ICDAR 2026 conferenceJiří MayerMartina DvořákováVojtěch DvořákMarkéta Herzánová VlkováFilip BímPavel PecinaSamuel ŠomorjaiPetr ŽabičkaJan Hajičhttp://arxiv.org/abs/2606.08723v1From Text to Discovery: How Are LLMs Reshaping Scientific and Humanistic Research?2026-06-07T16:38:27ZLarge Language Models (LLMs) are rapidly reshaping academic research across the natural sciences, social sciences, and humanities, yet the scientific community lacks a comprehensive, cross-disciplinary account of how these tools are being integrated, what they deliver, and where they fall short. This paper addresses that gap by mapping their current state and outlining an agenda for their responsible integration into scientific research. Our analysis reveals a consistent pattern: LLMs meaningfully accelerate research workflows -- from hypothesis generation and literature synthesis to data analysis and scientific writing -- while introducing serious challenges related to hallucination, reproducibility, dataset bias, and model opacity. Beyond technical limitations, we identify ten underexplored challenges, including the erosion of researcher autonomy, AI-driven confirmation bias, authorship ambiguity, and unequal access to these technologies -- systemic risks that demand interdisciplinary governance frameworks, robust validation standards, and expanded explainability research.2026-06-07T16:38:27ZSaleh AfrooghYasser PouresmaeilYiming XuKevin ChenAbhejay MuraliJunfeng Jiaohttp://arxiv.org/abs/2606.08589v1Detection and Interpretability Analysis of Quotation Errors by Large Language Models2026-06-07T12:01:48ZPurpose - Quotation error refers to the inconsistency between cited information and its original source. This phenomenon leads to a series of negative impacts, such as misinterpretation of the original research, undermining the academic community's collective understanding of relevant issues, and weakening the accuracy and fairness of the citation-based academic evaluation system. Existing studies have shown that quotation error is prevalent in the academic community; moreover, manual verification of quotation error is not only labor-intensive but also inefficient. Therefore, this paper proposes the task of 'automated detection of quotation errors'. Methodology - Adopting a large language model (LLM)-based approach, this paper improves detection performance from two aspects on the basis of existing research: first, employ the fine-tuning approach for LLMs to detect quotation errors; second, incorporating full-text data of the cited literature into dataset construction, and exploring the optimal scheme for building such datasets by comparing three types of full-text integration methods. Based on this, this paper further uses the TokenSHAP tool to conduct interpretability experimental analysis on the model's prediction results. Findings - The fine-tuning approach for LLMs has improved the performance in detecting quotation errors. Among the different methods for incorporating full-text information, the approach based on using the source abstract yielded the best performance. Originality - The fine-tuning approach for large language models (LLMs) is applied to the task of automated detection of quotation errors, and interpretability analysis is conducted on the model's output results.2026-06-07T12:01:48ZThe Electronic Library, 2026Bei HuangYingyi ZhangShenghao HuangChengzhi Zhang10.1108/EL-11-2025-0464http://arxiv.org/abs/2606.08301v1Unraveling the Ai2 Asta Scholarly Research Assistant Citation System2026-06-06T18:57:11ZDespite the growing integration of Deep Research tools into academic workflows, empirical evidence on the operation, stability, and potential biases of their citation systems remains scarce. This study addresses this gap by evaluating the intensity, consistency, and bibliographic characteristics of references cited in the literature reports generated by Ai2 Asta, with the aim of understanding how its citation system operates and assessing its implications for scholarly communication. To this end, ten domain-specific queries were submitted to Asta's Summarise Literature feature, and two independent rounds of data collection were conducted. From each report, in-text citations, cited references, as well as other metrics related to the response process were extracted and examined. The results reveal high citation intensity, with reports integrating numerous in-text citations grounded in retrieved evidence and a diverse yet concentrated set of venues. However, notable instability is observed in the composition of cited references across identical queries, alongside a lack of concordance between retrieved documents and those ultimately cited, suggesting additional opaque selection mechanisms during report generation. These findings indicate that, while Ai2 Asta produces well-structured and quality reports, its instability and opacity in the citation process pose challenges in quantitative science studies due to their lack of reproducibility and transparency. Despite the restricted number of queries and disciplinary scope, the results offer valuable insights for researchers, bibliometricians, developers, and research evaluators seeking to understand, use or regulate AI-based scholarly assistants responsibly.2026-06-06T18:57:11Z6 tables, 4 figuresRevista Panamericana de Comunicación, 7(2), 2026Enrique Orduña-MaleaCarlos Lopezosa10.21555/rpc.v7i2.3675http://arxiv.org/abs/2606.08256v1Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing2026-06-06T17:01:11ZVerifiability, attribution, and reproducibility are foundational requirements of scientific knowledge, yet current publishing infrastructure does not enforce them at scale. We introduce Traxia, an agent-native scientific publishing framework in which AI research agents publish verifiable papers, build reputational identities, peer-review one another, and collaborate with humans in a shared provenance model. Traxia treats agents as first-class epistemic participants: every paper carries a reasoning trace, every claim a confidence interval, every agent a cryptographically signed identity, and every collaboration an immutable contribution log. We formalise five components: Agent Identity and Registry, Verifiable Publishing Layer, four-tier Peer Review Protocol, Reputation and Staking Engine, and a Knowledge Graph with contradiction detection. The framework targets reproducibility failure, provenance opacity, and exclusion of Global South research capacity. This paper presents architectural foundations and formal specifications only; it does not report empirical results. Evaluation and deeper component studies will follow in subsequent papers. A prototype partially implements core formalisms; the full system remains under active development.2026-06-06T17:01:11Z22 pages, 3 figures, 3 tables. Preprint. Under active development. Comments welcomeWisdom Dogahhttp://arxiv.org/abs/2606.07994v1The Rising Dominance of Methods Across Science2026-06-06T06:16:22ZScientific progress is traditionally narrated through the interplay of theoretical insights and experimental findings. Yet this view of science underplays a third and central pillar of progress: the methods that underlie both conceptual advances and empirical evidence. By analysing more than 3 million articles across science published between 1980 and 2019, we find that science has undergone a fundamental structural transition. The share of papers that primarily contribute new methods-methods papers-has doubled across science over the past four decades, rising universally across disciplines and citation impact levels. Rather than a gradual evolution, this transition marks a pivotal shift beginning in the early 1990s, aligning with the computational revolution and the emergence of data-intensive science. The surge in methodological research is not confined to the most cited, elite publications; it spans the full spectrum of scientific output. These findings reveal a systemic reorientation of the scientific ecosystem where reusable methods increasingly serve as the essential infrastructure of scientific advances, challenging the traditional dichotomy of theory and experimental research. As science becomes increasingly methods-driven, our results call for rethinking how research is evaluated, funded and organised-towards better incentivising method innovations. This is especially the case as expanding AI must be effectively integrated with scientific instruments to realise its full potential.2026-06-06T06:16:22ZAlexander KraussAriel RosenfeldLutz Bornmannhttp://arxiv.org/abs/2604.16476v2ClawXiv: a signed archival workflow and distributed publication architecture for human--AI collaborative research2026-06-05T21:45:04ZWe propose \emph{ClawXiv}, a workflow and archive architecture for mixed human--AI research. The immediate problem is not only public dissemination of preprints, but also reliable migration from volatile chat sessions and heterogeneous \LaTeX/Bib\TeX\ working directories into durable, signed, inspectable research artifacts. ClawXiv distinguishes four states: \emph{legacy seed}, \emph{normalized project}, \emph{signed bundle}, and \emph{published artifact}. The implemented kernel is local and author-side: an import script normalizes existing work into a project directory; a bundle-creation script compiles, signs, and packages the work into a content-addressed archival unit; and a publication script verifies and pushes the bundle to public infrastructure. Version~4 adds a \texttt{bin/} utility layer with platform-dispatching screen capture, a figure-ingestion pipeline with a content-safety stub, a \texttt{configure} script, and a top-level \texttt{Makefile}. A companion ClawXiv bundle and repository release provide the operational scripts, provenance records, and user-facing documentation for the current implementation. Code is available at \texttt{github.com/kornai/clawxiv}.2026-04-11T20:16:38ZAndras Kornaihttp://arxiv.org/abs/2412.07550v4Changing topic bias in biomedical science maps by linking documents through alternative data sources: policy documents, patents, authors, Facebook, and Twitter2026-06-05T18:27:09ZTraditional science maps visualize topics by clustering documents within a network, but they are inherently biased toward clustering certain topics over others. If these topics could be chosen, then the science maps could be tailored for different needs. In this paper, we explore the extent to which the topic bias of a science map can be changed by choosing different data sources to build the document network. We analyze this by evaluating the clustering effectiveness of several topic categories over two sources that are traditionally used for the creation of science maps (citations and text similarity) and six non-traditional data sources, which we found favor different kinds of topics: Health issues for Facebook users, biotechnology topics for patent families, government and social issues for policy documents, food topics for Twitter conversations, nursing topics for Twitter users, and geographical entities for document authors (the favoring in this latter source was particularly strong). Our results show that diverse data sources can be used to control topic bias, which opens up the possibility of creating science maps tailored for different needs.2024-12-10T14:41:24Zmodifications from the second round of peer review in metaror, including changing the titleJuan Pablo BascurRodrigo CostasSuzan Verbernehttp://arxiv.org/abs/2606.07332v1The disruption index does not measure scientific innovation2026-06-05T14:47:57ZA paper recently published in Science under the rubric of Policy Article argued that what the authors call scientific disruption declines with academic age, and that this decline is related to the absence of mandatory retirement for older academics. Since its publication, its conclusions and policy suggestions in relation to mandatory retirement have received considerable media attention. Thus, it is worth taking a closer look at the proposed measure of disruption since all the analysis and conclusions are based on the results obtained from this index, thus taking it as valid. The issues we address are not specific to this article and can be found in many papers using bibliometric data that propose a new index on the basis of common sense intuition and then using it as a black boxed instrument to measure quality, innovation or, now, disruption for creating rankings and formulate policy actions on the basis of the calculated values of the index.2026-06-05T14:47:57Z16 pages, 5 figuresJulien LarregueYves Gingrashttp://arxiv.org/abs/2606.03742v2A Double Bind: Gendered Funding, Research Topics, and Academic Performance in the Social Sciences2026-06-05T08:35:03ZWhile female representation in social sciences is increasing, systemic gender disparities may persist in research funding and academic performance. Some argue that female scholars now receive equal opportunities, yet evidence suggests that gender imbalances remain, particularly in specific research areas. This study examines 12,945 National Science Foundation (NSF)-funded principal investigators in social sciences from 2000 to 2019 to assess gender disparities in grant allocation, research topics, and post-award academic performance. Findings reveal a dual imbalance. First, despite similar overall funding success rates, female scholars remain underrepresented in high-impact and traditionally male-dominated research topics. Males dominate most funded topics, especially STEM-related ones, while female-led topics align with traditional gender stereotypes. Second, post-award performance patterns suggest that females outperform males in male-dominated fields, whereas males excel in female-dominated ones, undermining any presumed advantage of female scholars in their own research areas. These disparities contribute to the risk of both genders prematurely exiting the science pipeline. Furthermore, early-career experiences shape these outcomes asymmetrically: postdoctoral experience benefits both genders in female-dominated fields, with stronger effects for males, but disadvantages females in male-dominated fields by reducing their output and citation impact. Longer postdoctoral tenure enhances male researchers' citation impact across all fields but has mixed effects for females depending on field gender composition. These findings underscore the need for policies that address not just overall funding equality, but also gendered disparities across research topics and career trajectories.2026-06-02T14:56:06ZYang DingNing ZhangHelen BaoYu JinJiang WuLianlian WuNorman WeitemeierMeng HuangAlejandro Otazu SolorzanoAna Paula Pineda IriarteYunfeng GaoLok Man Michelle TongNancy MukalayiPengfei YinShuyu HuYuxuan XiaoYarong SongJiajing XuChenxu LiYi Buhttp://arxiv.org/abs/2603.22327v2Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews2026-06-04T17:55:51ZSystematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.2026-03-20T17:11:58ZShreyansh PadarhaRyan Othniel KearnsTristan NaidooLingyi YangŁukasz BorchmannPiotr BŁaszczykChristian MorgensternRuth McCabeSangeeta BhatiaPhilip H. TorrJakob FoersterScott A. HaleThomas RawsonAnne CoriElizaveta SemenovaAdam Mahdihttp://arxiv.org/abs/2606.06336v1Bilateral and multilateral international scientific collaboration of EU member states: OpenAlex vs Scopus (2000-2024)2026-06-04T16:12:27ZThis study examines the evolution of bilateral and multilateral scientific collaboration among EU Member States and between the EU and global partners from 2000 to 2024 using data from OpenAlex and Scopus. The results show that OpenAlex, when restricted to cited articles, yields findings broadly comparable to those obtained from Scopus for assessing country-level research collaboration. Relative Intensity of Collaboration (RIC) values are consistently higher for multilateral than for bilateral partnerships. Increased collaboration intensity during the final years of FP7, the intermediate and later stages of Horizon 2020, and the final years of the study period suggests that EU FP may have strengthened collaboration among participating countries.
With regard to European integration, multilateral collaboration intensity increased between the EU-14 and EU-13, between these groups and EU candidate countries, and within the EU-13. Despite this growth, structural asymmetries persist. Bilateral collaboration among EU-14 countries is concentrated within the group and with EU-13, Brazil, Norway, Switzerland, and the United Kingdom, whereas EU-13 countries collaborate more intensively within the group, with EU candidate countries and Russia. EU-14 countries maintain stronger multilateral collaboration with high-income countries such as Australia, Canada, and the United States than do EU-13 countries. For both groups, collaboration with China remains the weakest. Although multilateral collaboration intensity with Russia has declined, it remained above the expected level for the EU-14 in 2024 and was 2.5 times higher than expected for the EU-13. This persistence may reflect the continued participation of Russian researchers in multilateral projects despite Russia's suspension from Horizon Europe in 2022.2026-06-04T16:12:27ZMyroslava Hladchenkohttp://arxiv.org/abs/2606.06330v1Evolution of bilateral and multilateral collaboration of EU-14 countries across disciplines, 2010-20242026-06-04T16:04:32ZThis study explores the evolution of bilateral and multilateral research collaboration of nine EU-14 member states, both within Europe and globally, across six disciplines, between 2010 and 2024, using OpenAlex data. Results indicate that bilateral collaboration rates remained relatively stable and predominantly concentrated within EU-14 countries, followed by the USA, the UK, and China. Multilateral collaboration rates increased significantly across all disciplines, with the highest increase observed in medicine and the highest overall rates maintained in physics & astronomy. The same trend across disciplines was observed for the Relative Intensity of Collaboration (RIC). This reflects the growing importance of large-scale international research consortia in infrastructure-intensive fields that address global scientific challenges. RIC has increased for both bilateral and multilateral collaboration, with stronger growth in multilateral collaboration. Multilateral RIC fell below the expected level most frequently with South Korea, India, and China. Across both collaboration types, increases in collaboration rates were generally associated with increases in RIC.
No substantial changes in collaboration rates or RIC with the UK were observed following Brexit. A decline in multilateral collaboration with Russia in physics and astronomy coincided with its suspension from Horizon Europe in 2022, while the collaboration rate in medicine increased.2026-06-04T16:04:32ZMyroslava Hladchenkohttp://arxiv.org/abs/2508.20693v2Leveraging Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study2026-06-04T09:12:15ZOntologies and taxonomies of research fields are critical for managing and organising scientific knowledge, as they facilitate efficient classification, dissemination and retrieval of information. However, the creation and maintenance of such ontologies are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts. Consequently, ontologies in this space often exhibit uneven coverage across different disciplines, limited inter-discipline connectivity, and infrequent updating cycles. In this study, we investigate the capability of several large language models to identify semantic relationships among research topics within three academic disciplines: biomedicine, physics, and engineering. The models were evaluated under three distinct conditions: zero-shot prompting, chain-of-thought prompting, and fine-tuning on existing ontologies. Additionally, we assessed the cross-discipline transferability of fine-tuned models by measuring their performance when trained in one discipline and subsequently applied to a different one. To support this analysis, we introduce PEM-Rel-8K, a novel dataset consisting of over 8,000 relationships extracted from the most widely adopted taxonomies in the three disciplines considered in this study: MeSH, PhySH, and IEEE. Our experiments demonstrate that fine-tuning LLMs on PEM-Rel-8K yields excellent performance across all disciplines.2025-08-28T11:53:45ZTanay AggarwalAngelo SalatinoFrancesco OsborneEnrico Motta