https://arxiv.org/api/3K4gWL0HZwTUHFObAwvex+ztRws 2026-06-14T22:58:38Z 30934 525 15 http://arxiv.org/abs/2601.22812v2 Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation 2026-05-20T17:52:10Z Large Language Models (LLMs) acting as artificial agents offer the potential for scalable behavioral research, yet their validity depends on whether LLMs can maintain stable personas across extended conversations. We address this point using a dual-assessment framework measuring both self-reported characteristics and observer-rated persona expression. Across two experiments testing four persona conditions (default, high, moderate, and low ADHD presentations), seven LLMs, and three semantically equivalent persona prompts, we examine between-conversation stability (3,473 conversations) and within-conversation stability (1,370 conversations and 18 turns). Self-reports remain highly stable both between and within conversations. However, observer ratings reveal a tendency for persona expressions to decline during extended conversations. These findings suggest that persona-instructed LLMs produce stable, persona-aligned self-reports, an important prerequisite for behavioral research, while identifying this regression tendency as a boundary condition for multi-agent social simulation. 2026-01-30T10:38:52Z Jana Gonnermann-Müller Jennifer Haase Nicolas Leins Thomas Kosch Sebastian Pokutta http://arxiv.org/abs/2605.21460v1 HITL-D: Human In The Loop Diffusion Assisted Shared Control 2026-05-20T17:49:56Z Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation. 2026-05-20T17:49:56Z Accepted for presentation at ICRA 2026 Riley Zilka Sergey Khlynovskiy Allie Wang Martin Jagersand http://arxiv.org/abs/2605.19362v2 Toward User Comprehension Supports for LLM Agent Skill Specifications 2026-05-20T17:49:17Z Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n$=$6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions. 2026-05-19T04:50:42Z To appear at ACM CAIS Workshop Agent Skill 2026 Zikai Alex Wen http://arxiv.org/abs/2605.21569v1 When Support Escalates Distress: Regulation and Escalation in LLM Responses to Venting and Advice-Seeking 2026-05-20T17:42:36Z Large language models are increasingly used for mental health support, yet little is known about whether their responses are psychologically safe across different help-seeking styles. We examine a foundational distinction in emotional disclosure, venting vs. advice-seeking, and whether LLMs respond in ways that regulate or amplify distress. Using 178,800 Reddit posts, we first show the two help-seeking styles are linguistically distinguishable at scale. We then introduce a measurement framework grounded in interpersonal emotion regulation theory that captures Regulation and Escalation as empirically independent dimensions. Across persona conditions (default, friend, therapist), GPT-5.3 responses systematically mirror help-seeking style: venting elicits more regulation, but also more escalation. Therapist personas reduce escalation while maintaining regulation, whereas friend personas increase both. A crowdsourced human study finds no user experience penalty for the safer therapist condition, but reveals that lay raters cannot reliably detect escalation without expert knowledge. Responses that feel supportive may simultaneously intensify distress in ways standard safety evaluation cannot see, and empathy metrics alone cannot replace a framework that measures both. 2026-05-20T17:42:36Z Vivienne Bihe Chi Adithya V Ganesan Ryan L Boyd Lyle Ungar Sharath Chandra Guntuku http://arxiv.org/abs/2605.21390v1 Designing Conversations with the Dead: How People Engage with Generative Ghosts 2026-05-20T16:45:32Z We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative. 2026-05-20T16:45:32Z Jack Manning Daniel Sullivan Dylan Thomas Doyle Anthony T. Pinter Jed R. Brubaker http://arxiv.org/abs/2605.21374v1 Combating Harms of Generative AI in CS1 with Code Review Interviews and a Flipped Classroom 2026-05-20T16:37:07Z Background and Context: Large Language Models (LLMs) are more accessible and accurate than ever before, raising significant concerns for computing educators. One major concern is students using LLMs to bypass the effort needed to understand concepts and metacognitive strategies essential for success in computer science. Objectives: We contribute a unique approach to assessing and building up student understanding through weekly oral code review assessments. These formative assessments incentivize students to understand their submitted code, regardless of whether or not the code was generated by AI tools. We also use a flipped classroom to provide time for students to learn concepts outside of class and provide ample time for students to schedule code review interviews. Methods: For this paper, we collected data from three semesters. We analyze student exam scores, keystroke logs, and surveys to understand how the new course policies affected student learning, behavior, and attitudes. Findings: Pairwise comparison of exam results reveals a statistically insignificant increase in average scores for Fall 2025 compared to previous semesters. Keystroke logs show a significant increase in characters pasted per total characters input into coding assignments in Fall 2025, pointing towards higher AI usage. Survey results show positive student sentiment towards code reviews at the end of Fall 2025, with nearly all negative feedback being addressable through better scheduling and more rigorous TA training. Implications: Oral code reviews with a flipped classroom appear to be effective at mitigating harms of LLM use while providing space for students to freely experiment with these tools. Our work suggests that students in Fall 2025 still show adequate understanding of material covered in written exams, despite dramatic increases in LLM usage for coding assignments. 2026-05-20T16:37:07Z Peter Fowles Erik Falor Sulove Bhattarai John Edwards Seth Poulsen http://arxiv.org/abs/2605.21361v1 Gen-AI-tecture: using generative AI to support architectural students in design tasks 2026-05-20T16:24:59Z The "Gen-AI-tecture" project embeds a locally executed, discipline-specific tool into a mixed-methods focus-group design, structured around three research objectives: (a) to evaluate how generative AI tools impact students' creativity in design-thinking processes and outcomes, (b) to assess whether these tools enhance inclusivity in learning processes, and (c) to examine how they develop students' AI-handling skills with a view to boosting future employability. Findings indicate enhanced creative fluency, broadened participation across diverse learner profiles, and strengthened confidence in AI-supported design processes. The study contributes evidence-based guidance for integrating generative-AI workflows into architectural pedagogy, demonstrating how such tools can operationalise constructivist principles of learner-led meaning-making, support connectivist understandings of learning as participation in human-AI networks, and advance universal learning theories by promoting more inclusive, flexible and accessible educational practices for contemporary learners. 2026-05-20T16:24:59Z Pre-print. Submitted to the Journal of Architectural Education Timo Kapsalis http://arxiv.org/abs/2605.21295v1 TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health 2026-05-20T15:25:46Z Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs. 2026-05-20T15:25:46Z Yuang Fan Lilin Xu Millie Wu Jingping Nie Qingyu Chen Yuzhe Yang Zhuo Zhang Xin Liu Subigya Nepal Xiaofan Jiang Xuhai "Orson" Xu http://arxiv.org/abs/2509.08010v2 Measuring and mitigating overreliance to build human-compatible AI 2026-05-20T11:59:51Z Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities. 2025-09-08T16:15:07Z Lujain Ibrahim Katherine M. Collins Sunnie S. Y. Kim Anka Reuel Max Lamparth Kevin Feng Lama Ahmad Prajna Soni Alia El Kattan Merlin Stein Siddharth Swaroop Vishakh Padmakumar Ilia Sucholutsky Andrew Strait Diyi Yang Q. Vera Liao Umang Bhatt http://arxiv.org/abs/2605.21035v1 The Quiet Path from Seemingly Minor Design Errors to Workplace AI Incidents 2026-05-20T11:13:45Z Recent human-computer interaction (HCI) research has revealed a widespread misalignment between how developers design workplace artificial intelligence (AI) systems, and what workers actually need from them. Yet, little research has examined the effects of this gap, or how it may cause harm. We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors. Using an Large Language Model (LLM)-as-an-expert approach, we extracted the main traits of the AI systems involved in those incidents using an established framework of twelve traits. We then compared them with the traits that 202 workers highly familiar with those tasks would have preferred. We found that as many as 83\% of workplace incidents stem from worker-AI misalignments. In most cases, workers wanted systems that are precise, insightful, or personal, but instead received systems that are basic, simple, or general. Over the years, fast AI caused a considerable number of incidents, yet these declined, and imaginative AI, with the mass introduction of generative AI, started to cause incidents. We also compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred. If the traits causing the incidents were the same as those designed by developers, then developers may be responsible for those incidents. We found that 74\% of task misalignments could be attributed to developers who tended to overfocus on efficiency and speed, especially for systems performing tasks in people-facing occupations such as those in the human resources sector. Our results call for design interventions that better align AI development with workers' needs, as without such corrections, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity. 2026-05-20T11:13:45Z Accepted in April 2026 to be published in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada Julia De Miguel Velázquez Sanja Šćepanović Andrés Gvirtz Daniele Quercia 10.1145/3805689.3812396 http://arxiv.org/abs/2605.20941v1 PaintCopilot: Modeling Painting as Autonomous Artistic Continuation 2026-05-20T09:27:06Z We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process. 2026-05-20T09:27:06Z Yunge Wen Yuancheng Shen Paul Pu Liang http://arxiv.org/abs/2605.20939v1 Toward 6G-enabled Brain Computer Interfaces: Technical Requirements, Use Cases, Challenges, and Future Trends 2026-05-20T09:26:40Z Brain computer interface (BCI) enables the brain to directly control an external device by converting neural signals into actionable outputs. However, effective real-time translation of brain activity strongly depends on the quality of neural communication between the brain and the external device. 6G is the next generation of wireless communication, expected to provide unprecedented levels of data rates, data security, and automation capabilities. In this context, integrating 6G into BCI systems would not only enhance the performance of brain-device communication, but would also create new opportunities for innovative applications. This work provides a comprehensive study on how BCI technology can be built effectively on top of 6G wireless networks by introducing several technical aspects and use cases. We first provide an overview of BCI and 6G, following their progression from early development to convergence through cognitive communication and advanced neural interfaces. We then highlight the need for the upcoming 6G systems toward BCI technology in every aspect, including 6G technologies such as intelligent edge and zero-touch networks, and 6G use cases such as digital twin, immersive communication, and internet of minds. Furthermore, we identify key technical challenges, open issues, and future research directions related to the 6G-enabled BCI paradigm. 2026-05-20T09:26:40Z Houda Hafi Bouziane Brik Nuraini Jamil Abdelkader Nasreddine Belkacem http://arxiv.org/abs/2509.07674v2 Temporal Counterfactual Explanations of Behaviour Tree Decisions 2026-05-20T06:21:06Z Explainability, in particular, the ability for robots to explain why they have made a decision or behaved in a certain way, is a critical tool in helping users understand the robots they interact and coexist with. Behaviour trees are a popular framework for controlling the decision-making of robots, and thus a natural question to ask is whether or not a system driven by a behaviour tree is capable of answering "why" questions. While explainability for behaviour tree-driven robots has seen some prior attention, no existing methods are capable of generating causal, counterfactual explanations which detail the reasons for robot decisions and behaviour. Therefore, in this work, we introduce a novel approach which automatically generates counterfactual explanations in response to contrastive "why" questions. Our method achieves this by first automatically building a causal model from the structure of the behaviour tree as well as domain knowledge about the state and individual behaviour tree nodes. The resultant causal model is then queried and searched to find a set of diverse counterfactual explanations. We demonstrate that our approach is able to correctly explain the behaviour of a wide range of behaviour tree structures and states in real time, unlike previous methods which are either unable to answer contrastive questions with causal explanations, or are not guaranteed to provide consistent and accurate explanations. By being able to answer a wide range of causal queries, our approach represents a step towards more transparent, understandable, and ultimately safe and trustworthy robotic systems. 2025-09-09T12:40:08Z 33 pages, 7 figures + 4 figures in appendices Tamlin Love Antonio Andriella Guillem Alenyà http://arxiv.org/abs/2605.20701v1 CandorMD: An AI-Assisted Audio Simulation and Feedback System for Training Clinicians for Medical Error Disclosure 2026-05-20T04:57:43Z Clinicians are expected to disclose harmful medical errors to patients and families in line with ethical, regulatory, and patient care standards, yet these conversations remain challenging because of their emotional complexity and limited training opportunities. Most physicians still learn primarily through lectures and observation, while static video tools-though available-are underused, lack adaptability across specialties, and deliver delayed, generic feedback. These gaps restrict skill development, reduce self-efficacy, and contribute to avoidance of disclosure conversations, ultimately compromising patient care and eroding trust. To address these needs, we designed CandorMD -- an AI-assisted simulation system that provides real-time practice, actionable feedback, and diverse practice environments tailored to individual learning needs. We conducted semi-structured interviews with physicians, risk managers, patient advocates, and communication experts to understand current practices, identify gaps, and collect feedback on CandorMD. Based on these insights, we present findings and design recommendations for the future of AI-supported medical communication training. 2026-05-20T04:57:43Z Inna Wanyin Lin Sahand Sabour Hong Sng Maxine Chan Minlie Huang Andrew White Tim Althoff http://arxiv.org/abs/2509.01231v3 Unpacking "Personal" Health Informatics for Proactive Collective Care 2026-05-20T04:47:11Z Care is primarily a collective phenomenon, with a practice that involves sharing health and wellbeing information within a trusted "care circle" of family members and companions for sensemaking, interpretation, decision-making, and follow-through. However, current digital health tools and information systems are designed for individuals and primarily intended for Personal Health Informatics (PHI). This mismatch between collective practice and individualistic design creates new challenges for the proactive use of such systems in care settings and limits adoption, sustained engagement, and meaningful use. To examine how people practice collective care and how (if) they perceive, adopt, and integrate PHI systems for proactive care, we conducted a sequential mixed-methods study. Through an initial survey (n=87) and semi-structured interviews (n=22), we found that their practices involve collectively understanding, analyzing, and sensemaking health information. However, we also found that their use of existing systems to support such practices is constrained by factors at personal, relational, technological, and structural levels that evolve over time. To explore redesigning PHI toward "Collective Health Informatics", we conducted stakeholder-specific interviews (n=12), a follow-up survey (n=116), and co-design workshops (n=6) to understand the dynamics required for collective settings while retaining agency. Using a design probe evaluation (n=38), we refine a design vision for coordinated, trustworthy action across such care relationships. Our findings motivate CC-Proact, an operational map that translates ecological influences into three design levers: Agency, Elicitation, and Engagement. Using this map, our work empirically examines collective care practices and offers ten design recommendations for building responsible systems that proactively support collective care. 2025-09-01T08:19:06Z 29 pages, 5 figures, 5 tables; A qualitative HCI study with prototype evaluation; Preprint of an under-review manuscript Shyama Sastha Krishnamoorthy Srinivasan Mohan Kumar Pushpendra Singh