https://arxiv.org/api/i5UsK5WlBEuK0uT6HH+AgLxeD5g2026-06-10T02:24:27Z288181515http://arxiv.org/abs/2606.08723v2From Text to Discovery: How Large Language Models Are Accelerating and Complicating Research Across Scientific and Humanistic Disciplines2026-06-09T13:19:37ZLarge Language Models (LLMs) are rapidly reshaping academic research across the natural sciences, social sciences, and humanities, yet the scientific community lacks a comprehensive, cross-disciplinary account of how these tools are being integrated, what they deliver, and where they fall short. This paper addresses that gap by mapping their current state and outlining an agenda for their responsible integration into scientific research. Our analysis reveals a consistent pattern: LLMs meaningfully accelerate research workflows -- from hypothesis generation and literature synthesis to data analysis and scientific writing -- while introducing serious challenges related to hallucination, reproducibility, dataset bias, and model opacity. Beyond technical limitations, we identify ten underexplored challenges, including the erosion of researcher autonomy, AI-driven confirmation bias, authorship ambiguity, and unequal access to these technologies -- systemic risks that demand interdisciplinary governance frameworks, robust validation standards, and expanded explainability research.2026-06-07T16:38:27ZSaleh AfrooghYasser PouresmaeilYiming XuKevin ChenAbhejay MuraliJunfeng Jiaohttp://arxiv.org/abs/2504.20519v5Large Language Model Chatbot Conversations vs Public Health Materials and Parental HPV Vaccination Intentions: A Randomized Clinical Trial2026-06-09T13:02:32ZHealth care systems are increasingly considering large language model (LLM)-based chatbots for vaccine communication, but evidence that they improve durable, behaviorally relevant outcomes beyond existing health materials is limited. This randomized clinical trial tested whether brief, multiturn LLM chatbot interactions increased parental intention to vaccinate children against human papillomavirus (HPV) compared with no intervention and government public health materials, and whether effects persisted. Parents in the US, Canada, and UK were recruited online from March 3 to May 25, 2025, with follow-up at 15 and 45 days. Eligible participants were adults with at least one HPV vaccine-eligible child who was unvaccinated or whose vaccination status was unknown. Participants were randomized to no-message control, country-matched government materials with at least 3 minutes of exposure, or a 3-minute GPT-4o chatbot interaction using either a default persuasive style or a shorter conversational style. The primary outcome was self-reported likelihood of vaccinating the child against HPV within 12 months, measured immediately after intervention on a 0-100 scale. Follow-up outcomes included vaccination intent and self-reported vaccination at 15 and 45 days. In total, 1297 participants were randomized (mean age 42.84 years; 72.1% female). Compared with no intervention, public health materials increased immediate vaccination intent (Cohen d = 0.53; 95% CI, 0.36-0.70), as did the default chatbot (d = 0.48; 95% CI, 0.30-0.65) and conversational chatbot (d = 0.33; 95% CI, 0.17-0.49). At 45 days, neither chatbot increased intent relative to controls, whereas public health materials maintained modest effects. No intervention increased self-reported vaccination uptake. Findings suggest well-designed public health materials may match or exceed short LLM chatbot conversations for HPV vaccine promotion.2025-04-29T07:59:46ZJAMA Network Open 2026Neil K. R. SehgalSunny RaiManuel TonneauAnish K. AgarwalJoseph CappellaMelanie KornidesLyle UngarAlison ButtenheimSharath Chandra Guntuku10.1001/jamanetworkopen.2026.16822http://arxiv.org/abs/2606.08534v2A Taxonomy of Real-World Asset Tokenization for Blockchain-Based Financial Infrastructure2026-06-09T12:16:49ZReal-world asset (RWA) tokenization has emerged as a prominent application of blockchain technology, enabling off-chain financial and non-financial assets to be represented through blockchain-based instruments. However, deployed RWA systems remain difficult to compare because legal claims, custody arrangements, token mechanics, verification processes, and on-chain integrations are often described separately. This paper develops a systems-level taxonomy of RWA tokenization to classify how off-chain assets are legally, economically, and technically represented on-chain. Following an iterative taxonomy-development method, we organize twenty-three dimensions into five components: governance, asset structure, token properties, distributed ledger technology, and economy. We apply the taxonomy to twenty major RWA systems selected by market capitalization and compare their design choices across asset classes and implementation models. The classification shows that current RWA tokenization is predominantly implemented through hybrid architectures: blockchain tokens support representation, transfer control, redemption workflows, pricing, and composability, while core legal guarantees remain anchored in off-chain legal wrappers, custodial arrangements, compliance processes, and verification mechanisms. The analysis also reveals recurring documentation gaps concerning voting rights, dispute forums, burn mechanics, supply constraints, and reserve verification. Overall, the taxonomy provides a structured basis for comparing RWA systems, identifying design patterns and limitations, and supporting future research on blockchain-based financial infrastructure.2026-06-07T09:30:11ZGiorgio VellaLuca PennellaMark C. Ballandieshttp://arxiv.org/abs/2606.10736v1Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs2026-06-09T11:47:04ZLarge online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.2026-06-09T11:47:04ZAccepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tablesYoussef MedhatJunsoo ParkPloy ThajchayapongAshok K. Goelhttp://arxiv.org/abs/2606.10726v1Beyond Journals: Rethinking Research Evaluation in Hungarian Computer Science2026-06-09T11:33:51ZThis study examines the role of top-tier conference publications in Hungarian computer science research. We show that the national scientometric practice, which is currently journal-oriented, diverges from international norms, creating incentive distortions in researcher evaluation. By linking multiple databases (iCore, DBLP, MTMT, MTA-ATT), we mapped Hungarian-affiliated CORE A* and A conference papers, their temporal and thematic distribution, and author trajectories. Our results indicate that, in theoretical fields, publishing at international conferences became common earlier than in applied fields. At the same time, in applied fields, successful researchers are more likely to continue their careers in foreign institutions or in industry positions. Overall, a substantial share of the already established, internationally most successful researchers are now affiliated with institutions abroad. We recommend recognizing CORE A* papers as equivalent to D1 and CORE A papers as equivalent to Q1 journals in national evaluation systems.2026-06-09T11:33:51ZA Hungarian version of this article has been accepted for publication in Magyar Tudomány, the journal of the Hungarian Academy of SciencesJános TapolcaiMárk JelasityLajos RónyaiAndrás BenczúrTibor GyimóthyCsaba Benedekhttp://arxiv.org/abs/2606.10711v1The Agentic Web Requires New Normative Infrastructure2026-06-09T11:15:48ZThe agentic web, in which users interact with the internet largely through agents acting on their behalf, is now technically feasible. However, many of the consumer and social benefits that could be realized by online AI agents acting scrupulously in their principals' interest are currently obstructed by outdated laws, terms of service, and other less formal practices which allow online platforms to block and degrade agent access, often in secret. No distinction is currently drawn between "malicious bots" and AI agents acting with the express delegated authority of a user. For the agentic web to realize its promise, it needs not only the technical infrastructure of protocols and interfaces, but the normative infrastructure of a broadly-accepted and socially-beneficial set of laws, norms and practices governing agentic access to online properties. Building that normative infrastructure requires a society-wide conversation. This paper aims to help precipitate that conversation, to identify normative principles that can guide it, and to advocate for policies that enable users' appropriately delegated agents to act online on their behalf, with as few curbs on their doing so as is reasonable given the other legitimate interests at stake.2026-06-09T11:15:48Z1 figureCameron PattisonMatthew BoulosNoam KoltChangbai LiTiziano PiccardiSeth Lazarhttp://arxiv.org/abs/2606.10660v1Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting2026-06-09T10:08:36ZAI inference services -- API subscriptions, enterprise chat tools, and SaaS products with embedded AI features -- fall unambiguously within Scope 3 Category 1 under the Corporate Sustainability Reporting Directive (CSRD), which requires disclosure for fiscal years starting January 2024. Yet no standardised methodology exists for including them in corporate GHG inventories. Current practice either omits the category entirely or applies a generic economic input-output (EEIO) factor calibrated to the ICT sector as a whole, overestimating AI inference emissions by 10-40x relative to physically derived alternatives.
We propose a four-tier framework that matches estimation precision to the data organisations can realistically obtain, progressing from direct token-based physical estimation -- using GPU energy benchmarks and regional grid carbon intensities -- down to a spend-based EEIO fallback for services where no usage data exists. Emission factors are derived from peer-reviewed GPU energy benchmarks (ML.ENERGY Leaderboard v3), confirmed grid carbon intensities (EPA eGRID 2023; Ember 2023), and published water use effectiveness data (Li et al., 2025). Applied to a 200-person European firm, the framework yields a total below 1 tCO2e, illustrating that the compliance challenge is methodological rather than magnitude-driven. We further document a water-carbon trade-off that current ESG tools do not surface: Sweden's hydro-dominated grid delivers the lowest carbon intensity in our dataset but the highest water footprint, with direct implications for data centre location strategy.2026-06-09T10:08:36ZPreprint. Data repository: https://doi.org/10.5281/zenodo.20443586. 18 pages, 3 figures, 6 tablesGuillermo LlopisSOMA AI, Barcelonahttp://arxiv.org/abs/2507.01062v4Quantifying Perception-Based Student Success with Generative AI: An Exploratory Monte Carlo Simulation2026-06-09T08:53:09ZGenerative artificial intelligence (GenAI) tools such as ChatGPT have attracted growing attention in higher education, particularly in relation to how students perceive their usefulness, usability, and educational value. This study develops an exploratory Monte Carlo simulation framework for quantifying perception-based student success in the context of GenAI use. A PRISMA-informed structured literature search in Scopus identified nineteen empirical studies published between 2023 and 2025, of which six reported item-level means and standard deviations suitable for probabilistic modelling. One coherent 10-item, 5-point Likert-scale usability-oriented instrument was selected as a canonical proof-of-concept dataset and used to parameterise an inverse-variance-weighted Monte Carlo simulation generating 10,000 synthetic observations. The results show that the weighting structure substantially influences the simulated outcome, with System Efficiency and Learning Burden receiving the largest inverse-variance weight and therefore the strongest influence on the composite score. The study offers a transparent, reproducible, and privacy-preserving proof-of-concept framework linking structured literature search, item-level summary statistics, and probabilistic modelling.2025-06-30T09:50:38ZPublished in Education Sciences. This article is an extended and substantially revised version of a conference paper presented at the Melbourne Institute of Technology ICETE Conference, Sydney, NSW, Australia, 9-10 February 2026. The earlier conference version is available at DOI 10.25397/ppny-f488Education Sciences 2026, 16, 832Seyma Yaman Kayadibi10.3390/educsci16060832http://arxiv.org/abs/2606.10575v1Platform Sorting Drives Ideological Fragmentation in the Social Media Ecosystem2026-06-09T08:41:17ZIdeological asymmetries in online political communication are often studied as localized phenomena emerging within communities. Here, we show that fragmentation instead operates at the level of entire platforms, consistent with a process of platform sorting in which users increasingly align with ideologically congruent environments. We analyze political information dynamics across Bluesky, Facebook, Reddit, Truth Social, Twitter/X, and YouTube during the 2020 and 2024 US presidential elections, combining measures of content sharing, engagement allocation, and user-level ideological orientation. Across platforms, ideological fragmentation emerges consistently and persists over time. Platforms exhibit distinct ideological profiles that persist across the two election cycles, ranging from strongly left-leaning to strongly right-leaning environments. Longitudinal analyses further reveal limited ideological variability among persistent user cohorts, indicating that apparent changes within single platforms reflect ecosystem-level sorting rather than convergence toward neutrality. Taken together, our results show that the dynamics of platform sorting is not a transient reaction to political events or moderation interventions, but a persistent structural feature of the social media ecosystem.2026-06-09T08:41:17ZEdoardo Di MartinoAlessandro GaleazziMatteo CinelliMichele StarniniWalter Quattrociocchihttp://arxiv.org/abs/2606.10544v1From Stacks to Circuits: A Regenerative Socio-Technical Roadmap for AI Infrastructure within Planetary Boundaries2026-06-09T08:11:25ZCurrent scaling trajectories for Generative AI, typified by linear supply-side "stacks," prioritize performance density while externalizing significant thermodynamic and material costs. As the "Twin Transition" of green and digital transformation accelerates, the industry faces technology gaps - including Scope 3 emissions and e-waste recycling - that impede sustainable scaling and lead to social tensions. This study proposes a Regenerative Socio-Technical roadmap that repurposes the Sustainable Production and Consumption system map to reframe artificial intelligence infrastructure as a system-of-systems governed ultimately by planetary limits. By integrating the Institute of Electrical and Electronics Engineers International Roadmap for Devices and Systems (IEEE IRDS) sustainability considerations for semiconductor facilities, the study proposes a metabolic circuit framework that centers "Values and Needs" within production and consumption relationship loops. This study identifies critical gaps in current Nvidia-centric roadmaps and proposes a competing reference architecture. It demonstrates how a spontaneous order of resource parsimony and planetary accountability can provide an actionable pathway for regulatory compliance and industrial resilience in the digital circular economy.2026-06-09T08:11:25ZThis document is a working paper and reflects the state of research as of May 2026. Comments are welcome and should be directed to the corresponding author at h.liao@ieee.org. This work is accepted for presentation at the 32nd IEEE ICE/ITMC Conference, Porto, Portugal2026 IEEE International Conference on Engineering, Technology, and Innovation (ICE/ITMC), forthcoming 2026Han-Teng LiaoKaren Anghttp://arxiv.org/abs/2407.20242v5BadRobot: Jailbreaking Embodied LLM Agents in the Physical World2026-06-09T05:19:18ZEmbodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.2024-07-16T13:13:16ZAccepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.ioInternational Conference on Learning Representations (ICLR) 2025Hangtao ZhangChenyu ZhuXianlong WangZiqi ZhouChanggan YinMinghui LiLulu XueYichen WangShengshan HuAishan LiuPeijin GuoLeo Yu Zhanghttp://arxiv.org/abs/2606.08251v2Contemporary AI lacks the imagination to diverge or negate in science2026-06-09T03:31:39ZBold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.2026-06-06T16:39:28ZHonglin BaoSiyang WuXiao LiuSida LiShiyun CaoJames A. Evanshttp://arxiv.org/abs/2606.10330v1The Power of Altruism in Sticker Economics: Generosity Minimizes Collective Costs and Overprotective Norms Fuel Inefficiency2026-06-09T02:24:59ZCollecting the FIFA World Cup sticker album presents a classic public-goods and collective-action dilemma, in which completing a collection on one's own is highly inefficient. To evaluate how localized community norms shape collective efficiency, we use agent-based modeling and Monte Carlo simulations, parameterized with empirical field observations from exchange meetups in Natal, Brazil. Reflecting the tournament's recent expansion, the Panini 2026 album features 980 individual stickers, including 68 metallic specials. We contrast a standard baseline economy (1:2 special-to-normal exchange ratio) with an overprotective, strict strategy (exclusive special-for-special trading) and an altruistic, generous strategy (in which advanced players surrender needed duplicates to assist peers). Our findings reveal that overprotective rules trap liquidity and drive network-wide inefficiency. The strict strategy increases median completion costs by 10 packs and severely penalizes the least fortunate 5\% of collectors, adding 20 packs in large cities and 30 in small communities. Conversely, widespread generosity optimizes network liquidity and dramatically compresses the long tail of bad luck. Introducing the generous strategy reduces required purchases for the 5th percentile by 90 packs in large-scale configurations and 130 packs in smaller clusters. Furthermore, widespread altruism triggers a strong functional coupling that effectively synchronizes completion rates across the network. This study demonstrates that while rigid, protective norms degrade collective welfare, generosity successfully mitigates pack-draw variance, transforming an expensive, isolated hobby into a resilient, highly efficient public good.2026-06-09T02:24:59ZLuana Ferraz AlvarengaCaetano Alvarenga CostaCésar Rennó-Costahttp://arxiv.org/abs/2605.03217v3Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability2026-06-09T02:10:17ZLarge language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.2026-05-04T23:12:32ZYash AggarwalAtmika GortiVinija JainAman ChadhaKrishnaprasad ThirunarayanManas Gaurhttp://arxiv.org/abs/2604.13776v2Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking2026-06-09T01:55:56ZWatermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We argue that watermarking is part of the pluralistic alignment pipeline and should be held to the same evaluation standards. We connect this to governance frameworks currently mandating watermarking deployment without requiring fairness evaluation. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.2026-04-15T12:06:56Z7 pages. Accepted at the Multimodal Alignment for a Pluralistic Society (MAPS) Workshop, CVPR 2026Alexander NemecekOsama ZafarYuqiao XuWenbiao LiErman Ayday