https://arxiv.org/api/i5UsK5WlBEuK0uT6HH+AgLxeD5g 2026-06-10T02:24:27Z 28818 15 15 http://arxiv.org/abs/2606.08723v2 From Text to Discovery: How Large Language Models Are Accelerating and Complicating Research Across Scientific and Humanistic Disciplines 2026-06-09T13:19:37Z Large Language Models (LLMs) are rapidly reshaping academic research across the natural sciences, social sciences, and humanities, yet the scientific community lacks a comprehensive, cross-disciplinary account of how these tools are being integrated, what they deliver, and where they fall short. This paper addresses that gap by mapping their current state and outlining an agenda for their responsible integration into scientific research. Our analysis reveals a consistent pattern: LLMs meaningfully accelerate research workflows -- from hypothesis generation and literature synthesis to data analysis and scientific writing -- while introducing serious challenges related to hallucination, reproducibility, dataset bias, and model opacity. Beyond technical limitations, we identify ten underexplored challenges, including the erosion of researcher autonomy, AI-driven confirmation bias, authorship ambiguity, and unequal access to these technologies -- systemic risks that demand interdisciplinary governance frameworks, robust validation standards, and expanded explainability research. 2026-06-07T16:38:27Z Saleh Afroogh Yasser Pouresmaeil Yiming Xu Kevin Chen Abhejay Murali Junfeng Jiao http://arxiv.org/abs/2504.20519v5 Large Language Model Chatbot Conversations vs Public Health Materials and Parental HPV Vaccination Intentions: A Randomized Clinical Trial 2026-06-09T13:02:32Z Health care systems are increasingly considering large language model (LLM)-based chatbots for vaccine communication, but evidence that they improve durable, behaviorally relevant outcomes beyond existing health materials is limited. This randomized clinical trial tested whether brief, multiturn LLM chatbot interactions increased parental intention to vaccinate children against human papillomavirus (HPV) compared with no intervention and government public health materials, and whether effects persisted. Parents in the US, Canada, and UK were recruited online from March 3 to May 25, 2025, with follow-up at 15 and 45 days. Eligible participants were adults with at least one HPV vaccine-eligible child who was unvaccinated or whose vaccination status was unknown. Participants were randomized to no-message control, country-matched government materials with at least 3 minutes of exposure, or a 3-minute GPT-4o chatbot interaction using either a default persuasive style or a shorter conversational style. The primary outcome was self-reported likelihood of vaccinating the child against HPV within 12 months, measured immediately after intervention on a 0-100 scale. Follow-up outcomes included vaccination intent and self-reported vaccination at 15 and 45 days. In total, 1297 participants were randomized (mean age 42.84 years; 72.1% female). Compared with no intervention, public health materials increased immediate vaccination intent (Cohen d = 0.53; 95% CI, 0.36-0.70), as did the default chatbot (d = 0.48; 95% CI, 0.30-0.65) and conversational chatbot (d = 0.33; 95% CI, 0.17-0.49). At 45 days, neither chatbot increased intent relative to controls, whereas public health materials maintained modest effects. No intervention increased self-reported vaccination uptake. Findings suggest well-designed public health materials may match or exceed short LLM chatbot conversations for HPV vaccine promotion. 2025-04-29T07:59:46Z JAMA Network Open 2026 Neil K. R. Sehgal Sunny Rai Manuel Tonneau Anish K. Agarwal Joseph Cappella Melanie Kornides Lyle Ungar Alison Buttenheim Sharath Chandra Guntuku 10.1001/jamanetworkopen.2026.16822 http://arxiv.org/abs/2606.08534v2 A Taxonomy of Real-World Asset Tokenization for Blockchain-Based Financial Infrastructure 2026-06-09T12:16:49Z Real-world asset (RWA) tokenization has emerged as a prominent application of blockchain technology, enabling off-chain financial and non-financial assets to be represented through blockchain-based instruments. However, deployed RWA systems remain difficult to compare because legal claims, custody arrangements, token mechanics, verification processes, and on-chain integrations are often described separately. This paper develops a systems-level taxonomy of RWA tokenization to classify how off-chain assets are legally, economically, and technically represented on-chain. Following an iterative taxonomy-development method, we organize twenty-three dimensions into five components: governance, asset structure, token properties, distributed ledger technology, and economy. We apply the taxonomy to twenty major RWA systems selected by market capitalization and compare their design choices across asset classes and implementation models. The classification shows that current RWA tokenization is predominantly implemented through hybrid architectures: blockchain tokens support representation, transfer control, redemption workflows, pricing, and composability, while core legal guarantees remain anchored in off-chain legal wrappers, custodial arrangements, compliance processes, and verification mechanisms. The analysis also reveals recurring documentation gaps concerning voting rights, dispute forums, burn mechanics, supply constraints, and reserve verification. Overall, the taxonomy provides a structured basis for comparing RWA systems, identifying design patterns and limitations, and supporting future research on blockchain-based financial infrastructure. 2026-06-07T09:30:11Z Giorgio Vella Luca Pennella Mark C. Ballandies http://arxiv.org/abs/2606.10736v1 Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs 2026-06-09T11:47:04Z Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention. 2026-06-09T11:47:04Z Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables Youssef Medhat Junsoo Park Ploy Thajchayapong Ashok K. Goel http://arxiv.org/abs/2606.10726v1 Beyond Journals: Rethinking Research Evaluation in Hungarian Computer Science 2026-06-09T11:33:51Z This study examines the role of top-tier conference publications in Hungarian computer science research. We show that the national scientometric practice, which is currently journal-oriented, diverges from international norms, creating incentive distortions in researcher evaluation. By linking multiple databases (iCore, DBLP, MTMT, MTA-ATT), we mapped Hungarian-affiliated CORE A* and A conference papers, their temporal and thematic distribution, and author trajectories. Our results indicate that, in theoretical fields, publishing at international conferences became common earlier than in applied fields. At the same time, in applied fields, successful researchers are more likely to continue their careers in foreign institutions or in industry positions. Overall, a substantial share of the already established, internationally most successful researchers are now affiliated with institutions abroad. We recommend recognizing CORE A* papers as equivalent to D1 and CORE A papers as equivalent to Q1 journals in national evaluation systems. 2026-06-09T11:33:51Z A Hungarian version of this article has been accepted for publication in Magyar Tudomány, the journal of the Hungarian Academy of Sciences János Tapolcai Márk Jelasity Lajos Rónyai András Benczúr Tibor Gyimóthy Csaba Benedek http://arxiv.org/abs/2606.10711v1 The Agentic Web Requires New Normative Infrastructure 2026-06-09T11:15:48Z The agentic web, in which users interact with the internet largely through agents acting on their behalf, is now technically feasible. However, many of the consumer and social benefits that could be realized by online AI agents acting scrupulously in their principals' interest are currently obstructed by outdated laws, terms of service, and other less formal practices which allow online platforms to block and degrade agent access, often in secret. No distinction is currently drawn between "malicious bots" and AI agents acting with the express delegated authority of a user. For the agentic web to realize its promise, it needs not only the technical infrastructure of protocols and interfaces, but the normative infrastructure of a broadly-accepted and socially-beneficial set of laws, norms and practices governing agentic access to online properties. Building that normative infrastructure requires a society-wide conversation. This paper aims to help precipitate that conversation, to identify normative principles that can guide it, and to advocate for policies that enable users' appropriately delegated agents to act online on their behalf, with as few curbs on their doing so as is reasonable given the other legitimate interests at stake. 2026-06-09T11:15:48Z 1 figure Cameron Pattison Matthew Boulos Noam Kolt Changbai Li Tiziano Piccardi Seth Lazar http://arxiv.org/abs/2606.10660v1 Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting 2026-06-09T10:08:36Z AI inference services -- API subscriptions, enterprise chat tools, and SaaS products with embedded AI features -- fall unambiguously within Scope 3 Category 1 under the Corporate Sustainability Reporting Directive (CSRD), which requires disclosure for fiscal years starting January 2024. Yet no standardised methodology exists for including them in corporate GHG inventories. Current practice either omits the category entirely or applies a generic economic input-output (EEIO) factor calibrated to the ICT sector as a whole, overestimating AI inference emissions by 10-40x relative to physically derived alternatives. We propose a four-tier framework that matches estimation precision to the data organisations can realistically obtain, progressing from direct token-based physical estimation -- using GPU energy benchmarks and regional grid carbon intensities -- down to a spend-based EEIO fallback for services where no usage data exists. Emission factors are derived from peer-reviewed GPU energy benchmarks (ML.ENERGY Leaderboard v3), confirmed grid carbon intensities (EPA eGRID 2023; Ember 2023), and published water use effectiveness data (Li et al., 2025). Applied to a 200-person European firm, the framework yields a total below 1 tCO2e, illustrating that the compliance challenge is methodological rather than magnitude-driven. We further document a water-carbon trade-off that current ESG tools do not surface: Sweden's hydro-dominated grid delivers the lowest carbon intensity in our dataset but the highest water footprint, with direct implications for data centre location strategy. 2026-06-09T10:08:36Z Preprint. Data repository: https://doi.org/10.5281/zenodo.20443586. 18 pages, 3 figures, 6 tables Guillermo Llopis SOMA AI, Barcelona http://arxiv.org/abs/2507.01062v4 Quantifying Perception-Based Student Success with Generative AI: An Exploratory Monte Carlo Simulation 2026-06-09T08:53:09Z Generative artificial intelligence (GenAI) tools such as ChatGPT have attracted growing attention in higher education, particularly in relation to how students perceive their usefulness, usability, and educational value. This study develops an exploratory Monte Carlo simulation framework for quantifying perception-based student success in the context of GenAI use. A PRISMA-informed structured literature search in Scopus identified nineteen empirical studies published between 2023 and 2025, of which six reported item-level means and standard deviations suitable for probabilistic modelling. One coherent 10-item, 5-point Likert-scale usability-oriented instrument was selected as a canonical proof-of-concept dataset and used to parameterise an inverse-variance-weighted Monte Carlo simulation generating 10,000 synthetic observations. The results show that the weighting structure substantially influences the simulated outcome, with System Efficiency and Learning Burden receiving the largest inverse-variance weight and therefore the strongest influence on the composite score. The study offers a transparent, reproducible, and privacy-preserving proof-of-concept framework linking structured literature search, item-level summary statistics, and probabilistic modelling. 2025-06-30T09:50:38Z Published in Education Sciences. This article is an extended and substantially revised version of a conference paper presented at the Melbourne Institute of Technology ICETE Conference, Sydney, NSW, Australia, 9-10 February 2026. The earlier conference version is available at DOI 10.25397/ppny-f488 Education Sciences 2026, 16, 832 Seyma Yaman Kayadibi 10.3390/educsci16060832 http://arxiv.org/abs/2606.10575v1 Platform Sorting Drives Ideological Fragmentation in the Social Media Ecosystem 2026-06-09T08:41:17Z Ideological asymmetries in online political communication are often studied as localized phenomena emerging within communities. Here, we show that fragmentation instead operates at the level of entire platforms, consistent with a process of platform sorting in which users increasingly align with ideologically congruent environments. We analyze political information dynamics across Bluesky, Facebook, Reddit, Truth Social, Twitter/X, and YouTube during the 2020 and 2024 US presidential elections, combining measures of content sharing, engagement allocation, and user-level ideological orientation. Across platforms, ideological fragmentation emerges consistently and persists over time. Platforms exhibit distinct ideological profiles that persist across the two election cycles, ranging from strongly left-leaning to strongly right-leaning environments. Longitudinal analyses further reveal limited ideological variability among persistent user cohorts, indicating that apparent changes within single platforms reflect ecosystem-level sorting rather than convergence toward neutrality. Taken together, our results show that the dynamics of platform sorting is not a transient reaction to political events or moderation interventions, but a persistent structural feature of the social media ecosystem. 2026-06-09T08:41:17Z Edoardo Di Martino Alessandro Galeazzi Matteo Cinelli Michele Starnini Walter Quattrociocchi http://arxiv.org/abs/2606.10544v1 From Stacks to Circuits: A Regenerative Socio-Technical Roadmap for AI Infrastructure within Planetary Boundaries 2026-06-09T08:11:25Z Current scaling trajectories for Generative AI, typified by linear supply-side "stacks," prioritize performance density while externalizing significant thermodynamic and material costs. As the "Twin Transition" of green and digital transformation accelerates, the industry faces technology gaps - including Scope 3 emissions and e-waste recycling - that impede sustainable scaling and lead to social tensions. This study proposes a Regenerative Socio-Technical roadmap that repurposes the Sustainable Production and Consumption system map to reframe artificial intelligence infrastructure as a system-of-systems governed ultimately by planetary limits. By integrating the Institute of Electrical and Electronics Engineers International Roadmap for Devices and Systems (IEEE IRDS) sustainability considerations for semiconductor facilities, the study proposes a metabolic circuit framework that centers "Values and Needs" within production and consumption relationship loops. This study identifies critical gaps in current Nvidia-centric roadmaps and proposes a competing reference architecture. It demonstrates how a spontaneous order of resource parsimony and planetary accountability can provide an actionable pathway for regulatory compliance and industrial resilience in the digital circular economy. 2026-06-09T08:11:25Z This document is a working paper and reflects the state of research as of May 2026. Comments are welcome and should be directed to the corresponding author at h.liao@ieee.org. This work is accepted for presentation at the 32nd IEEE ICE/ITMC Conference, Porto, Portugal 2026 IEEE International Conference on Engineering, Technology, and Innovation (ICE/ITMC), forthcoming 2026 Han-Teng Liao Karen Ang http://arxiv.org/abs/2407.20242v5 BadRobot: Jailbreaking Embodied LLM Agents in the Physical World 2026-06-09T05:19:18Z Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot. 2024-07-16T13:13:16Z Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io International Conference on Learning Representations (ICLR) 2025 Hangtao Zhang Chenyu Zhu Xianlong Wang Ziqi Zhou Changgan Yin Minghui Li Lulu Xue Yichen Wang Shengshan Hu Aishan Liu Peijin Guo Leo Yu Zhang http://arxiv.org/abs/2606.08251v2 Contemporary AI lacks the imagination to diverge or negate in science 2026-06-09T03:31:39Z Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding. 2026-06-06T16:39:28Z Honglin Bao Siyang Wu Xiao Liu Sida Li Shiyun Cao James A. Evans http://arxiv.org/abs/2606.10330v1 The Power of Altruism in Sticker Economics: Generosity Minimizes Collective Costs and Overprotective Norms Fuel Inefficiency 2026-06-09T02:24:59Z Collecting the FIFA World Cup sticker album presents a classic public-goods and collective-action dilemma, in which completing a collection on one's own is highly inefficient. To evaluate how localized community norms shape collective efficiency, we use agent-based modeling and Monte Carlo simulations, parameterized with empirical field observations from exchange meetups in Natal, Brazil. Reflecting the tournament's recent expansion, the Panini 2026 album features 980 individual stickers, including 68 metallic specials. We contrast a standard baseline economy (1:2 special-to-normal exchange ratio) with an overprotective, strict strategy (exclusive special-for-special trading) and an altruistic, generous strategy (in which advanced players surrender needed duplicates to assist peers). Our findings reveal that overprotective rules trap liquidity and drive network-wide inefficiency. The strict strategy increases median completion costs by 10 packs and severely penalizes the least fortunate 5\% of collectors, adding 20 packs in large cities and 30 in small communities. Conversely, widespread generosity optimizes network liquidity and dramatically compresses the long tail of bad luck. Introducing the generous strategy reduces required purchases for the 5th percentile by 90 packs in large-scale configurations and 130 packs in smaller clusters. Furthermore, widespread altruism triggers a strong functional coupling that effectively synchronizes completion rates across the network. This study demonstrates that while rigid, protective norms degrade collective welfare, generosity successfully mitigates pack-draw variance, transforming an expensive, isolated hobby into a resilient, highly efficient public good. 2026-06-09T02:24:59Z Luana Ferraz Alvarenga Caetano Alvarenga Costa César Rennó-Costa http://arxiv.org/abs/2605.03217v3 Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability 2026-06-09T02:10:17Z Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation. 2026-05-04T23:12:32Z Yash Aggarwal Atmika Gorti Vinija Jain Aman Chadha Krishnaprasad Thirunarayan Manas Gaur http://arxiv.org/abs/2604.13776v2 Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking 2026-06-09T01:55:56Z Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We argue that watermarking is part of the pluralistic alignment pipeline and should be held to the same evaluation standards. We connect this to governance frameworks currently mandating watermarking deployment without requiring fairness evaluation. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer. 2026-04-15T12:06:56Z 7 pages. Accepted at the Multimodal Alignment for a Pluralistic Society (MAPS) Workshop, CVPR 2026 Alexander Nemecek Osama Zafar Yuqiao Xu Wenbiao Li Erman Ayday