https://arxiv.org/api/EThWCRVltQl4PoPgIQ3ya+fx/iI 2026-06-21T10:15:02Z 28997 615 15 http://arxiv.org/abs/2512.20298v4 Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives 2026-05-22T09:51:52Z

Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth over breadth case study directly compares state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

2025-12-23T12:05:01Z Karolina Drożdż Kacper Dudzic Anna Sterna Marcin Moskalewicz http://arxiv.org/abs/2606.13693v1 Limited Marginal Benefit of Reasoning-Heavy LLM Deployment in ESG Narrative Scoring: A 4-Model Consensus Study on Japanese Listed Firms 2026-05-22T03:35:33Z

Automated scoring of ESG narrative disclosures with large language models (LLMs) is gaining traction, yet whether reasoning-heavy frontier models add value commensurate with their cost remains empirically unsettled. We evaluate this question on a corpus of ten Japanese listed firms across three rubric axes -- quantitative targets, progress-tracking infrastructure, and external-standard alignment -- using a four-model consensus design that combines a reasoning-on frontier model with three reasoning-off contemporaries. Across 120 firm x axis x model scores, the pooled mean absolute deviation between the reasoning-on model and each reasoning-off counterpart is 0.38 on a 5-point scale; only 2% of pairwise comparisons reach a two-point deviation, and none exceeds two points. Per-firm cost accounting shows the reasoning-on arm alone costs roughly 5.6x as much as the three-provider reasoning-off ensemble, for outcomes that differ only within small margins. We conclude that in span-based ESG narrative scoring, reasoning-heavy deployment does not materially improve outcomes relative to reasoning-off consensus, while substantially increasing operational cost. We discuss implications for cost-effective ESG auto-scoring pipelines and LLM deployment governance in applied accountability settings. An earlier version of this work is available on SSRN (Abstract ID 6683303).

2026-05-22T03:35:33Z 12 pages. Earlier version available on SSRN, Abstract ID 6683303 Hiroyuki Kokubu http://arxiv.org/abs/2605.23193v1 CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening 2026-05-22T03:20:04Z

Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.

2026-05-22T03:20:04Z Preprint, 9 pages. Website: https://hello-diana.github.io/CultivAgents/ Yiyang Wang Moeiini Reilly Britney Johnson Kefei Yan Alex Cabral Josiah Hester http://arxiv.org/abs/2605.23177v1 Cognitive offloading and the speedup illusion in human-AI interaction 2026-05-22T02:53:12Z

Large language models (LLMs) have the potential to boost human productivity by speeding up task completion -- provided users know when to offload cognitive work to them. But we do not know if users are well-calibrated in estimating these potential time savings. We conducted a preregistered large-scale behavioral study (N = 1237) to characterize mismatches between expectations and reality, with a focus on simple cognitive tasks. While actual completion times between independent completion and AI-assisted completion did not differ, participants predicted AI to be significantly faster. The same bias was not observed when imagining help from another human participant. We identify a speedup illusion where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times. Additionally, time and effort dissociate: participants reported lower subjective effort with AI despite equivalent completion times. This suggests that completion time itself is not sufficient to characterize efficiency gains.

2026-05-22T02:53:12Z Proceedings of the 48th Annual Meeting of the Cognitive Science Society Sunny Yu Myra Cheng Ahmad Jabbar Ilia Sucholutsky Katherine M. Collins Dan Jurafsky Robert D. Hawkins http://arxiv.org/abs/2605.23162v1 SolarChain: Bridging Physical Law, Verifiable Trust, and Sustainable Markets for Urban Energy Resilience 2026-05-22T02:30:54Z

Urban decarbonization requires scaling rooftop solar across millions of fragmented producers, yet cities face a fundamental tension: energy data is easily manipulated, and economic incentives often reward speculation rather than actual infrastructure deployment. We present SolarChain, a platform that resolves both problems by anchoring digital accountability to the thermodynamic limits of solar energy conversion. Using real-time meteorological data, geospatial coordinates, and first-principles calculations of solar yield, the system establishes a hard physical boundary for every panel's maximum possible output; any reported generation exceeding this limit is automatically rejected before entering the shared ledger. This trustless verification enables a peer-to-peer marketplace with programmatic reward structures that continuously reinvest value into equipment maintenance and market liquidity, preventing the speculative hoarding that typically destabilizes blockchain-based marketplaces. When electricity is consumed, the corresponding digital credits are permanently retired in direct proportion to physical energy dissipation, creating an auditable one-to-one mapping between urban consumption and carbon accounting. Deployed across heterogeneous city nodes, the prototype demonstrates resilience against data injection attacks while lowering capital barriers for community-level solar expansion. Beyond energy, the framework offers a general model for coordinating economic activity with physical law in any domain where distributed infrastructure demands both data integrity and sustainable investment. We release the data and code as open-access on GitHub.

2026-05-22T02:30:54Z Shilin Ou Yifan Xu Zhenshan Zhang Luyao Zhang Ming-Chun Huang http://arxiv.org/abs/2605.23123v1 Defining AI Fatigue in Academic Contexts: Dimensions, Indicators, and a Stage-Based Model Using Grounded Theory 2026-05-22T00:46:39Z

The integration of AI tools in academic settings has introduced a distinct form of strain that existing frameworks like technostress and digital fatigue have not yet fully addressed. This study develops a conceptual model and identifies the dimensions that define AI fatigue as a form of strain arising from sustained academic use of AI tools. Using grounded theory analysis of open-ended responses from 1,054 university students across three universities in the Philippines, the study examined the cognitive, motivational, emotional, physical, and attentional pressures students experienced during AI-supported academic work. Analysis produced five dimensions of AI fatigue, namely Cognitive Overload, Motivational Disengagement, Moral Unease, Physical Strain, and Attentional Drift, each consisting of two indicators grounded in participant accounts. The findings also yielded the AI Fatigue Model, a stage-based framework that explains how these pressures accumulate and reinforce one another across repeated AI interaction in academic tasks. These contributions establish a conceptual and exploratory foundation for AI fatigue as a distinct construct and provide a basis for future instrument validation, scale development, and cross-contextual inquiry in academic settings where AI now mediates student learning.

2026-05-22T00:46:39Z 17 pages, journal article, Volume 25, Issue 5, International Journal of Learning, Teaching and Educational Research, 25(5), 91-107 (2026) John Paul P. Miranda Emmanuel B. Parreño Jovita G. Rivera 10.26803/ijlter.25.5.5 http://arxiv.org/abs/2602.12316v2 GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory 2026-05-22T00:32:35Z

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

2026-02-12T17:29:52Z Pepijn Cobben Xuanqiang Angelo Huang Thao Amelia Pham Isabel Dahlgren Terry Jingchen Zhang Zhijing Jin http://arxiv.org/abs/2605.23103v1 A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works 2026-05-21T23:40:51Z

I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

2026-05-21T23:40:51Z Queenie Luo http://arxiv.org/abs/2605.23093v1 A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses 2026-05-21T23:00:40Z

Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

2026-05-21T23:00:40Z Yan Jiang Sihong Liu Philip A. Fisher http://arxiv.org/abs/2601.09600v3 Information Access of the Oppressed: Freirean Design for Emancipatory Information Access 2026-05-21T22:51:29Z

Online information access (IA) platforms are targets of authoritarian capture. We explore the question of how to safeguard our platforms and ensure emancipatory outcomes through the lens of Paulo Freire's theories of emancipatory pedagogy. Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, and transparency. We make explicit, with the intention to challenge, the technologist-user dichotomy in IA platform development that mirrors the teacher-student relation in Freire's analysis. By extending Freire's analysis to IA, we critique the technologists-as-liberator frame where it is the burden of (altruistic) technologists to mitigate the risks of emerging technologies for marginalized communities. Instead, we advocate for Freirean Design whose goal is to structurally expose the platform for co-option and co-construction by community members in aid of their emancipatory struggles.

2026-01-14T16:15:26Z Bhaskar Mitra Nicola Neophytou Sireesh Gururaja http://arxiv.org/abs/2605.23048v1 StanBKT: Rethinking Parameter Estimation in Bayesian Knowledge Tracing 2026-05-21T21:27:10Z

Bayesian Knowledge Tracing (BKT) is a widely used and interpretable student modeling approach in intelligent tutoring systems and educational data mining. However, most implementations rely on expectation-maximization or related optimization methods that yield only point estimates, limiting uncertainty quantification and principled comparisons across learners and conditions. We introduce StanBKT, an open-source Python package for estimating BKT models using Bayesian inference in Stan. StanBKT provides a unified framework supporting Hamiltonian Monte Carlo, variational inference, Pathfinder, and optimization-based estimation while preserving the hidden Markov structure and interpretability of classical BKT. It supports standard, grouped, and hierarchical BKT models, flexible prior specification, posterior predictive inference, and utilities for visualization and diagnostics. We evaluate StanBKT on large-scale observational and controlled educational datasets. On the ASSISTments 2020 dataset, we show that supported inference methods achieve comparable predictive performance while differing in computational efficiency and posterior fidelity. We further demonstrate how posterior inference enables principled comparison of condition-specific parameters in an educational intervention involving perceptual cue manipulations. Results illustrate how uncertainty quantification facilitates more reliable interpretation of differences in learning, forgetting, guessing, and slipping parameters across experimental conditions. Overall, StanBKT extends BKT beyond point estimation by providing a flexible framework for probabilistic student modeling, uncertainty quantification, and hierarchical inference in educational data mining.

2026-05-21T21:27:10Z 5 figures, 7 tables Siddhartha Pradhan Yanping Pei Morgan Lee Puyuan Zhang Erin Ottmar Adam C. Sales http://arxiv.org/abs/2605.23026v1 Opportunities and Risks of Generative AI through the Health Information Journey 2026-05-21T20:49:21Z

Artificial intelligence is fundamentally changing how health content is encountered and acted upon across both the information and healthcare ecosystems. AI systems now generate claims, curate information, interpret symptoms, synthesize evidence, and guide decisions, with significant opportunities and risks for the public. Potential benefits include improvements in access, comprehension, and continuity of care. At the same time, AI can introduce inaccurate or manipulative content that is difficult to distinguish from reliable guidance, and encourage automated decisions that affect care with little transparency or recourse. We introduce a four-stage framework to examine how these opportunities and risks unfold as the public moves through the information environment and into formal healthcare.

2026-05-21T20:49:21Z Matthew R. DeVerna Harry Yaojun Yan Kai-Cheng Yang Filippo Menczer http://arxiv.org/abs/2605.22995v1 Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good 2026-05-21T19:49:27Z

Agentic AI systems are increasingly proposed for social-good domains, often invoking the United Nations Sustainable Development Goals (SDGs) as a vocabulary of global benefit. Yet claims of social good do not establish accountability to the communities a system claims to serve. We present a structured survey of 112 papers on agentic AI for social good published between 2015 and 2026. We find a moral-geographic asymmetry: papers are least likely to specify geographic context in precisely the domains where local political, legal, and cultural context matters most. Across the corpus, 82 of 112 papers (73%) specify no geographic context. Papers aligned with health or physical/ecological SDGs specify geography 37-40% of the time, while papers aligned with institutional and social-policy SDGs do so only 13%. SDG 16, peace, justice, and strong institutions, is both the most-covered goal in the corpus and the one with the lowest geographic-specification rate. We interpret this as moral abstraction: agentic AI for social good often treats institutional good as universal in ways it does not treat health or ecological good. A second finding compounds this: only 28 of 112 papers (25%) report any real-world deployment or small-scale test. We identify five accountability gaps and propose a minimal reporting standard for more context-specific, participatory, and accountable agentic AI for social good.

2026-05-21T19:49:27Z Poli Nemkova Haeshitha Indukuri Jaedon Charles http://arxiv.org/abs/2602.13241v2 Empowering 9-1-1 Calltaking Training with Generative AI: Experiences and Lessons Learned 2026-05-21T19:00:01Z

Emergency call-takers form the first operational link in public safety response, handling over 240 million calls annually while facing a sustained training crisis: staffing shortages exceed 25\% in many centers, and preparing a single new hire can require up to 720 hours of one-on-one instruction that removes experienced personnel from active duty. Traditional training approaches struggle to scale under these constraints, limiting both coverage and feedback timeliness. In partnership with Metro Nashville Department of Emergency Communications (MNDEC), we designed, developed, and deployed a GenAI-powered call-taking training system under real-world constraints. Over six months, deployment scaled from initial pilot to 190 operational users across 1,120 training sessions, exposing systematic challenges around system delivery, rigor, resilience, and human factors that remain largely invisible in controlled or purely simulated evaluations. By analyzing deployment logs capturing 98,429 user interactions, organizational processes, and stakeholder engagement patterns, we distill four key lessons, each coupled with concrete design and governance practices. These lessons provide grounded guidance for researchers and practitioners seeking to deliver AI-driven training systems in safety-critical public sector environments where practical constraints fundamentally shape human-centric design.

2026-01-30T18:05:59Z Accepted at IEEE SmartComp 2026 Zirong Chen Yilin Liu Meiyi Ma http://arxiv.org/abs/2603.04383v2 Turning Trust to Transactions: Tracking Affiliate Marketing and FTC Compliance in YouTube's Influencer Economy 2026-05-21T17:15:33Z

YouTube has evolved into a powerful platform where creators monetize their influence through affiliate marketing, raising concerns about transparency and ethics, especially when creators fail to disclose their affiliate relationships. Although regulatory agencies like the US Federal Trade Commission (FTC) have issued guidelines to address these issues, non-compliance and consumer harm persist, and the extent of these problems remains unclear. In this paper, we introduce tools, developed with insights from recent advances in Web measurement and NLP research, to examine the state of the affiliate marketing ecosystem on YouTube. We apply these tools to a 10-year dataset of 2 million videos from nearly 540,000 creators, analyzing the prevalence of affiliate marketing on YouTube and the rates of non-compliant behavior. Our findings reveal that affiliate links are widespread, yet disclosure compliance remains low, with most videos failing to meet FTC standards. Furthermore, we analyze the effects of different stakeholders in improving disclosure behavior. Our study suggests that the platform is highly associated with improved compliance through standardized disclosure features. We recommend that regulators and affiliate partners collaborate with platforms to enhance transparency, accountability, and trust in the influencer economy.

2026-03-04T18:47:12Z ICWSM 2026 Chen Sun Yash Vekaria Zubair Shafiq Rishab Nithyanand