https://arxiv.org/api/ZU5do17DdM/DO9mKrakVORwW7Ek2026-06-14T17:49:29Z3093445015http://arxiv.org/abs/2605.24142v1A Taxonomy of Metacognitive Learning Scenarios in Professional Contexts: Integrating Systems Theory with Empirical Constraints2026-05-22T19:03:58ZMetacognitive theories provide foundational frameworks for understanding self-regulated learning, yet they lack systematic integration into comprehensive scenario taxonomies capable of guiding AI-enhanced professional development interventions. Existing models inadequately specify how metacognitive components combine into distinct learning scenarios or how professionals progress from novice to expert functioning. A six-node open systems model, consisting of Environment, Input, Processes, Structures, Output, and Feedback, was developed by synthesizing four major theoretical frameworks. Combinatorial enumeration generated 216 mathematically possible learning scenarios. Four sequential constraint-based filters, including psychological plausibility, educational relevance, measurement feasibility, and intervention potential, informed by empirical workplace learning research, reduced this space to 24 priority scenarios. Five focal scenarios were subjected to formal concept analysis. The 24 priority scenarios were distributed across three developmental tiers: novice, with 6 scenarios; developing, with 10 scenarios; and expert/adaptive, with 8 scenarios. Analysis revealed critical theoretical gaps regarding the dynamic reconfiguration of monitoring-control relationships across expertise levels, the role of feedback topology in metacognitive development, and trade-offs between internal integration and external connectivity. Multiple viable developmental trajectories were identified. The taxonomy enables targeted, scenario-specific professional development interventions and generates testable predictions for advancing metacognition theory beyond primarily descriptive accounts.2026-05-22T19:03:58ZDavid C. GibsonMary Elizabeth AzukasMeryem Yilmaz Soyluhttp://arxiv.org/abs/2510.16435v2What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics2026-05-22T18:07:30ZWith the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (21.4%), the robot's capabilities (12.6%), and performance assessments (10.7%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.2025-10-18T10:16:45ZLennart WachowiakAndrew ColesGerard CanalOya Celiktutanhttp://arxiv.org/abs/2605.23890v1Divergent Paths to Depolarization: Dialogue Design Determines the Prosocial Benefits of AI-Assisted Political Argumentation2026-05-22T17:51:10ZArgumentative dialogues across political divides can reduce polarization, yet opportunities for citizens to engage with opposing views in accessible and structured ways remain limited. AI dialogue partners offer a scalable framework for such open-mindedness exercises, but how the format of human-AI dialogues shapes their benefits remains unclear. In a two-session online experiment, 469 US participants were assigned to argue either for or against their own attitude on a contested political issue with an AI chatbot. Our experimental findings show attitude-congruent dialogues produced greater immediate reduction in both affective and opinion polarization than attitude-incongruent dialogues. By contrast, attitude-incongruent dialogues elicited weaker cognitive state empathy than the non-AI reference task but increased cognitive trait empathy in the two-week period between sessions, suggesting the effects of active generation of attitude-incongruent arguments may emerge over time. These findings highlight dialogue design as a key determinant of effective AI-mediated behavioral interventions.2026-05-22T17:51:10ZJianlong ZhuSyed Muhammad Jhon Raza NaqviCarolin-Theresa ZiemerUsman NaseemIngmar Weberhttp://arxiv.org/abs/2605.23867v1Human Decision-Making with Persuasive and Narrative LLM Explanations2026-05-22T17:25:02ZLarge language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.2026-05-22T17:25:02ZLaura R. MarusichMary Grace Kozuch DhoogheJonathan Z. BakdashMurat Kantarciogluhttp://arxiv.org/abs/2604.26145v2Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems2026-05-22T17:07:34ZAI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.2026-04-28T22:05:57ZAccepted to Misleading Impacts Resulting from AI Generated Explanations (MIRAGE) Workshop @ IUI 2026Ben KnightWm. Matthew KennedyDanielle CarvalhoIsaac PattisJames Edgellhttp://arxiv.org/abs/2605.13930v3Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders2026-05-22T16:58:48ZEEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.2026-05-13T16:02:56ZPreprint. 14 pages, 7 figures, 4 tablesWilliam Lehn-SchiølerMagnus Ruud KjærRahul ThapaMagnus Guldberg PedersenAnton Mosquera StorgaardNick WilliamsRadu GatejTue Lehn-SchiølerAndreas Brink-KjærSadasivan PuthusserypadySándor BeniczkyJames ZouLars Kai Hansenhttp://arxiv.org/abs/2605.23823v1"I can't read your mind": A Study of Neurodivergent Computing Students' Experiences with Collaborative Active Learning2026-05-22T16:24:46ZComputing courses often feature active learning techniques that promote collaboration and social interaction between students. However, neurodivergent students' preferences and experiences with these techniques are not well understood. We conducted a survey of neurodivergent computing students (n=24), specifically autistic students or students with ADHD, and neurotypical computing students (n=20) to understand how the structure of collaborative active learning affects their comfort in computing courses. We also interviewed four computing students on the autism spectrum or with ADHD to gain more contextualized insights into their experiences and accessibility recommendations. Our survey surfaces how team dynamics and assignment structure can impact neurodivergent students' comfort in computing courses. Neurodivergent students expressed discomfort with assignments that lack structure or have ambiguous expectations. Neurodivergent students prefer smaller teams that work together frequently with explicitly defined roles. Our interviews identified ways that neurodivergent students cope with discomfort in collaborative active learning, including self-selecting roles and self-disclosure. While preliminary, our results highlight how instructors can design collaborative active learning to be more equitable and accessible for neurodivergent students.2026-05-22T16:24:46ZCynthia ZastudilSrishty MuthusekaranRayhona NasimovaStephen MacNeilhttp://arxiv.org/abs/2602.01694v3Beyond the Single Turn: Reframing Refusals as Dynamic Experiences Embedded in the Context of Mental Health Support Interactions with LLMs2026-05-22T16:18:49ZContent Warning: This paper contains participant quotes and discussions related to mental health challenges, emotional distress, and suicidal ideation. Large language models (LLMs) are increasingly used for mental health support, yet the model safeguards -- particularly refusals to engage with sensitive content -- remain poorly understood from the perspectives of users and mental health professionals (MHPs) and have been reported to cause real-world harms. This paper presents findings from a sequential mixed-methods study examining how LLM refusals are experienced and interpreted in mental health support interactions. Through surveys (N=53) and in-depth interviews (N=16) with individuals using LLMs for mental health support and MHPs, we reveal that refusals are not isolated, single-turn system behaviors but rather constitute dynamic, multi-phase experiences: pre-refusal expectation formation, refusal triggering and encounter, refusal message framing, resource referral provision, and post-refusal outcomes. We contribute a multi-phase framework for evaluating refusals beyond binary policy compliance accuracy and design recommendations for future refusal mechanisms. These findings suggest that understanding LLM refusals requires moving beyond single-turn interactions toward recognizing them as holistic experiences embedded within users' support-seeking trajectories and the broader LLM design pipeline.2026-02-02T06:08:04ZNingjing TangAlice QianQiaosi WangEsther HoweBlake BullwinkelPaola PedrelliJina SuhHoda HeidariHong Shenhttp://arxiv.org/abs/2605.23804v1Perceptually Lossless Tactile Texture Synthesis with Compact Spectral Envelope Models2026-05-22T16:05:33ZModern audio-visual media rely on compact representations for efficient storage and transmission, whereas realistic digital touch still depends on high-resolution tactile recordings. Existing approaches for representing tactile signals constrain manipulation and limit the generation of new content. Here, we introduce two compact representations, spectral beta and spectral slope, that capture the temporal spectral structure of finger-surface friction signals while preserving perceptually relevant information. Spectral beta models spectral skewness using a two-parameter beta distribution, whereas spectral slope approximates the spectrum with an asymmetric bandpass filter defined by low- and high-pass orders. We evaluated these representations in a perceptual study with 14 participants using five virtual textures rendered on a friction-modulation display and compared them with physical textures and high-fidelity reproductions of recorded signals. Spectral beta achieved perceptual similarity ratings comparable to those of the original high-fidelity reproductions. Regression analysis further showed that matching spectral energy across nine critical frequency bands was the strongest predictor of perceived realism. Together, these findings suggest that tactile texture perception depends primarily on fundamental temporal spectral patterns and that modeling these patterns is sufficient for perceptually realistic rendering. These results establish an efficient and scalable framework for haptic compression, communication, and synthetic texture generation.2026-05-22T16:05:33Z16 pages and 8 figuresJagan K. BalasubramanianYasemin Vardarhttp://arxiv.org/abs/2605.23787v1Engagement-Optimized Care: When LLMs become Mental Health Infrastructure2026-05-22T15:50:26ZGeneral-purpose LLMs are increasingly functioning as mental health infrastructure due to gaps in care left by provider shortages, inadequate insurance coverage, social isolation, and stigma around formal help-seeking. This shift poses a distinct problem for AI ethics: systems neither designed nor governed as care technologies are being used as such, while their dominant design incentives optimize for engagement rather than user well-being. We present findings from a qualitative, longitudinal study with 18 US-based participants who use general-purpose LLMs for socioemotional support and participated in one or more of our study phases, including initial interviews, a four-week diary study, focus groups, and exit interviews. Participants turned to LLMs because other forms of support were unavailable, unaffordable, socially costly, or inadequate. As they continued to use these systems, design features such as anthropomorphic cues, default validation, persistent responsiveness, and weak disengagement mechanisms shaped their ongoing reliance. Participants described meaningful support alongside dependency, epistemic distortion through one-sided validation, privacy expectations without corresponding legal protection, and continued use despite awareness of these risks. We argue these dynamics reflect a structurally unfair tradeoff: users accept risks because support is otherwise absent, while available systems are optimized to deepen engagement and lack care-based accountability. The paper makes three contributions: it traces the arc through which LLMs become care infrastructure and identifies distinct ethical tensions at each stage, shifts analysis from turn-based exchanges to longitudinal trajectories of use, and argues that accountability belongs at the design and incentive conditions through which these systems become care infrastructure rather than at the output or crisis-response layer.2026-05-22T15:50:26Z10 pages, 1 figureBriana VecchioneMeryl YeLivia GarofaloRanjit Singhhttp://arxiv.org/abs/2605.23676v1AI at the Front Lines of Platform Governance: Using LLMs to Support Illegal Content Reporting under the Digital Services Act2026-05-22T14:22:43ZIllegal content reporting mechanisms are a key technical and organizational measure through which online platforms address illegal content under the European Union Digital Services Act (DSA). Article 16 requires user notices to be sufficiently substantiated and submitted in good faith, placing users in the difficult position of interpreting legal and procedural language and translating ambiguous content into legally meaningful categories and reasons. We investigate how large language model (LLM)-based assistants can support this reporting process. In a controlled user study (N = 450) using an interface modeled on a major platform reporting workflow, we compare three conditions: unaided reporting, a conventional explainable AI assistant (XAI) that suggests a single legal category with a rationale, and an evaluative AI assistant (EvalAI) that presents balanced pro and con arguments across candidate legal provisions. We further examine these assistance forms under systematically varied AI error regimes. Our results show that EvalAI improves provision-level accuracy under AI error and reduces misclassification distance relative to conventional XAI, particularly for near-miss and overbreadth errors. When AI output is correct, conventional XAI enables faster decisions, but neither AI assistance form reliably improves the quality of users' substantiated explanations relative to unaided reporting. We discuss design implications for compliance-oriented reporting interfaces, highlighting trade-offs between accuracy, deliberation, explanation quality, and vulnerability to misleading AI output.2026-05-22T14:22:43ZMarie-Therese SekwenzShreyan BiswasRita Hermann-GsengerUjwal Gadiraju10.1145/3805689.3812301http://arxiv.org/abs/2605.23663v1Detecting Drunk Driving Using Off-the-Shelf Smartwatches2026-05-22T14:13:28ZAlcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.2026-05-22T14:13:28Z27 pages, 7 figuresRobin DeuberLanlan YangMichal BechnyChristoph HeckMatthias PfäffliMatthias BantleFlorian von WangenheimElgar FleischWolfgang WeinmannManuel GüntherFelix WortmannVarun Mishrahttp://arxiv.org/abs/2605.23598v1When Youth Enter the Algorithmic Wild: Discovering and Understanding Potentially Harmful Teen Videos on Douyin and Kwai2026-05-22T13:06:46ZShort-video platforms like Douyin and Kwai have become central to adolescent digital life, but they also risk exposing teens to algorithmically amplified harmful content. Despite its societal importance, the scale, mechanisms, and real-world impact of this exposure remain poorly understood. Measuring it is challenging: recommendation feeds are personalized black boxes, harmful content employs sophisticated evasion tactics, and naive crawlers fail to replicate authentic teen behavior. To bridge this gap, we propose PHTV-Scout, the first large-scale, behaviorally grounded measurement framework for Potentially Harmful Teen Videos (PHTVs). We integrate an offline survey of 683 adolescents with a tri-module online pipeline: (1) PHTV Hunter simulates teen accounts to collect recommendation feeds; (2) PHTV Arbiter, a LoRA-finetuned multimodal classifier, detects PHTVs with 94.29% accuracy and 96.41% precision; and (3) PHTV Analyzer performs fine-grained categorization and impact assessment. Over six months, we analyzed 186,727 videos and 51,287 comments, uncovering a troubling 6.11% PHTV prevalence--dominated by Child Sexual Exploitation Imagery (53.2%)--and revealing that harmful content thrives through covert interactions (e.g., grooming comments, self-disclosure) and active evasion (semantic camouflage, noise injection). Crucially, while Youth Mode blocks 100% of PHTVs, its low adoption (30-41%) leaves most teens unprotected. We further show that exposure is driven not by user identity but by regulation, platform algorithms, and even passive browsing, exposing the fragility of adolescent information environments. Our findings call for a paradigm shift from reactive takedowns to proactive, human-centered safeguards.2026-05-22T13:06:46ZShaoxuan ZhouYafei SunJing ZhangXianghang Mihttp://arxiv.org/abs/2605.23535v1MindCopilot: Towards Formalizing and Evaluating Granular Human-LLM Co-Writing2026-05-22T11:55:56ZRecent writing assistants are increasingly shifting from passive, prompt-driven interaction to proactive, suggestion-based completion, which integrates localized continuations into the writing flow and reduces coordination burden. However, existing evaluations simply focus on output quality, failing to capture how users accept, edit, or repair suggestions in real-time interaction, and thus obscuring the true usability of proactive co-writing systems. To address this gap, we adopt a sequential, behavior-centered view of interactive writing and formalize co-writing as a Human-in-the-Loop Markov Decision Process, modeling writing as an interaction shaped by user acceptance and editing decisions. Based on this formulation, we introduce the Co-Writing Fidelity Suite, an interaction-aware metric suite that captures both user-assistant alignment and cognitive editing effort, including Hierarchical Acceptance Rate and Knowledge-aware Editing Distance. We conduct a large-scale simulation study across 16 writing domains, using 1,688 controlled continuation queries sampled from different writing stages. Our analysis reveals systematic effects of interaction structure on acceptance behavior and editing cost. A follow-up user study with 30 participants confirms that these behavioral patterns align with real user experience. Together, our findings demonstrate that interaction-aware evaluation provides insights beyond output-only metrics and informs the design of more effective proactive writing assistants.2026-05-22T11:55:56Z30 pages, 8 figures. Accepted to IJCAI 2026Youqing FangUniversity of Science and Technology of ChinaShanghai AI LaboratoryYinhao TangUniversity of Science and Technology of ChinaShanghai AI LaboratoryYanan SunShanghai AI LaboratoryJiangning LiuShanghai AI LaboratoryZiyi WangShanghai AI LaboratoryXun ZhaoShanghai AI LaboratoryBin LiuUniversity of Science and Technology of ChinaWeiming ZhangUniversity of Science and Technology of ChinaKuikun LiuShanghai AI LaboratoryWenwei ZhangShanghai AI LaboratoryKai ChenShanghai AI Laboratoryhttp://arxiv.org/abs/2603.26906v2KI-Adventskalender: An Informal Learning Intervention for Data & AI Literacy2026-05-22T11:01:43ZSecondary school students increasingly encounter AI systems whose outputs depend on data quality, evaluation choices and modeling assumptions. To provide accessible entry points to these interconnected concepts, we developed KI-Adventskalender, a free web-based extracurricular initiative with 24 didactically curated, short, guided micro-challenges released daily in December, targeting data-centric competencies and socio-technical themes that shape how data are interpreted in practice. Drawing on two annual iterations, we report aggregate platform traces characterizing participation and task-level engagement. Participation increased substantially in 2025, but early attrition persists. Progression stabilized after midpoint: among users reaching Day 12 in 2025, more than 75% completed the calendar. Competence cluster performance shifted across years; higher revision rates co-occurred with strong pass rates, suggesting sustained engagement. We use these observations to motivate a next-step measurement agenda: tighter task instrumentation, embedded micro-assessments and mixed-method evaluation designs that can distinguish persistence from conceptual uptake, knowledge progression and durable learning outcomes.2026-03-27T18:27:06ZAccepted at ACM CHI 2026 Workshop on Data Literacy for the 21st CenturyRahul SharmaLars HenrichLarisa IvanovaArsalan KarimzadmotallebiazarAnnette BieniusaLeo Van WaverenSebastian Vollmer