https://arxiv.org/api/IHTUFY67DXmqTiKalJINsB/12KE2026-06-14T10:59:17Z3093434515http://arxiv.org/abs/2605.29484v1Understanding the Rising Human-AI Affective Bonding: Conceptualization and HAABI Scale Development2026-05-28T07:12:41ZAs conversational AI becomes capable of sustained, affectively responsive interaction, users may form bonds beyond instrumental use. Existing measures often adapt interpersonal frameworks or focus on specific relational outcomes, leaving limited tools for assessing human-AI affective bonding on its own terms. Across two studies, we developed and validated the Human-AI Affective Bonding Inventory (HAABI). Study 1 used thematic analysis of semi-structured interviews with 52 emotionally engaged conversational AI users to identify cognitive, emotional, and behavioral features of bonding. Study 2 translated these insights into a self-report inventory and validated it among 673 Chinese conversational AI users. Exploratory and confirmatory factor analyses supported a 20-item, four-factor structure: emotional realism, separation anxiety, emotional investment, and romantic intimacy. The HAABI showed good reliability, construct validity, and known-groups validity. The scale therefore provides a neutral, user-centered tool for studying how affective bonds with conversational AI are formed, experienced, and related to users' psychological outcomes.2026-05-28T07:12:41ZLu ChenXiaoran XueRongqi DingFenghua TangAnji ZhouChenxi WangMengyu Miranda GaoZhuo Rachel Hanhttp://arxiv.org/abs/2605.29473v1Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles2026-05-28T07:04:56ZLanguage models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.2026-05-28T07:04:56ZDrishti GoelAgam GoyalVeda DudduOlivia PalJeongah LeeQiuyue Joy ZhongVioleta J. RodriguezDaniel S. BrownDong Whi YooRavi KarkarKoustuv Sahahttp://arxiv.org/abs/2605.29442v1How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions2026-05-28T06:35:39ZAI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.2026-05-28T06:35:39ZNingzhi TangChaoran ChenGelei XuYiyu ShiYu HuangCollin McMillanTao DongToby Jia-Jun Lihttp://arxiv.org/abs/2605.29400v1Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark2026-05-28T05:49:36ZWe benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.2026-05-28T05:49:36Z14 pages, 7 figures, 2 tables. PiSAR corpus and fine-tuned weights are proprietary to AprioriLabs; methodology and recipe releasedRahul BissaAbhishek VyasYash Jainhttp://arxiv.org/abs/2605.29399v1Expecting Empathy: How Interaction Context Shapes Norms for Empathic Response in Digital Communication2026-05-28T05:49:20ZA central challenge in affective computing is determining appropriate empathy levels for different interaction contexts. Prior work has characterized two poles: task-focused interactions, where empathy demand is near zero, and emotional disclosure, where empathy demand is high. This paper identifies a distinct intermediate type, decision support under stress, in which a sender faces a consequential choice while experiencing emotional difficulty. We hypothesize that this type elicits an asymmetric empathy profile: empathy comparable to emotional disclosure but instrumentality comparable to task-focused exchange. We test five hypotheses using 28,239 post-reply dyads from three Reddit advice communities, classified into three interaction types and scored for empathy depth, empathy form, and instrumental proportion using LLM-based annotation with pattern-based robustness checks. Results confirm the predicted asymmetric profile: decision-support-under-stress replies show significantly higher empathy than task-focused replies (M = 0.47 vs. 0.24, p < 0.001) while maintaining high instrumentality (0.83 vs. 0.77 for emotional disclosure, p < 0.001). Behavioral empathy dominates (36.6%), and community-validated response quality is negatively associated with empathic expression (r = -0.075, p < 0.001). Community norms modulate baselines substantially but preserve the structural ordering. These findings establish a human empathy baseline for this interaction type and have direct implications for calibrating empathic expression in affective AI systems.2026-05-28T05:49:20ZTao WangChi-Ching Juanhttp://arxiv.org/abs/2605.29392v1Offloading Score: Measuring AI Reliance Through Counterfactual Workflows2026-05-28T05:44:31ZAI tools are increasingly integrated into real-world workflows. However, existing measures of reliance on these tools focus on AI output adoption or on self-reported indicators, rather than how task effort is distributed between users and tools. Here, we introduce offloading score, a measure of reliance that quantifies the fraction of cognitive effort offloaded to an AI tool. Offloading Score is simulation-based -- we construct a counterfactual workflow by estimating how the user would have completed the task without the tool, and then computing the fraction of steps saved by using the tool. We validate offloading score through intrinsic evaluations of metric validity, and a controlled user study ($n=40$) with developers performing programming tasks using AI tools. We vary time pressure to test whether reliance measures capture the known increase in reliance under time pressure. We show that offloading score detects significantly higher reliance in time-constrained settings ($+43\%$, $p=0.018$), while usage-based and self-reported baseline measures of reliance do not distinguish the conditions. We complement this with descriptive insights showing that higher reliance manifests as greater delegation of subtasks to the tool and more direct reuse of AI outputs. Finally, we demonstrate an approach of using offloading score in combination with target outcomes of a task (e.g., code understanding) to identify when reliance may be (in)appropriate. Our framework offers two contributions: an instrument users can apply to measure and reflect on their own reliance, and a quantitative signal that agent designers can utilize to mitigate overreliance.2026-05-28T05:44:31ZPreprintVishakh PadmakumarLujain IbrahimZora Zhiruo WangJennifer WangQ. Vera LiaoDiyi Yanghttp://arxiv.org/abs/2605.27382v2The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs2026-05-28T02:39:08ZTelling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.2026-04-10T08:04:41ZXing ZhangGuanghui WangYanwei CuiWei QiuZiyuan LiBing ZhuPeiyang Hehttp://arxiv.org/abs/2605.29240v1Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI2026-05-28T02:00:06ZAI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $ρ=0.80$) and student-reported topic difficulty ($ρ=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.2026-05-28T02:00:06ZAccepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective LearningJunsoo ParkYoussef MedhatHtet Phyo WaiPloy ThajchayapongAshok K. Goelhttp://arxiv.org/abs/2605.29212v1MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality2026-05-28T00:51:48ZImage quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.2026-05-28T00:51:48Z12 pages, 6 figuresYujin ParkHaejun ChungIkbeom Janghttp://arxiv.org/abs/2605.26428v2Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation2026-05-27T22:26:21ZGenerating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\&A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content.
The working system is at https://slidesqaqa-974767694043.us-west1.run.app
The software repository is at https://github.com/blinding2submit/slidesqaqa2026-05-26T01:23:51Z15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendicesJim Salsmanhttp://arxiv.org/abs/2605.29120v1Improving outdoor navigation for people with blindness using an AI-driven smartphone application and personalized audio guidance2026-05-27T21:28:42ZGlobally, 340 million people have blindness or moderate to severe visual impairment (BVI)$^1$ which limits independent outdoor navigation$^2$ and negatively affects their health and quality of life$^{3,4}$. We surveyed 112 people with BVI and found that an ideal outdoor navigation aid must be able to perform turn-by-turn directions, path guidance, and obstacle detection and avoidance. Existing navigation tools such as white canes, guide dogs, and electronic travel aids often lack one or more of these criteria and may be expensive or inaccessible$^{5,6}$. Here we introduce Mobilio, a smartphone application that incorporates machine learning, sensor fusion algorithms, and personalized audio feedback to meet all of the outdoor navigation criteria. The reliability of the smartphone sensors and models used for navigation were assessed with engineering tests in representative navigation scenarios. We performed a series of experiments where Mobilio personalized audio feedback for participants with BVI (n = 14), guided them along an outdoor community path, and helped them navigate an obstacle course. Participants walking with Mobilio and a white cane reduced time to navigate a community path by 13 $\pm$ 3% and environmental contacts by 41 $\pm$ 5% compared to using Google Maps and a white cane. Mobilio achieved similar outdoor navigation reliability as a human guide. Participant surveys reported that Mobilio was easy to use, had a low perceived workload, and provided intuitive audio feedback. This work provides an accessible and personalized tool that may be an effective outdoor navigation aid to increase independence for people with BVI.2026-05-27T21:28:42ZRaymond LiuPatrick Sladehttp://arxiv.org/abs/2605.29090v1"It's OK Because...": The Wild West of Student Rationalization of AI Use in Academic Writing2026-05-27T20:44:18ZGenerative AI challenges academic integrity not only by enabling students to delegate substantial portions of their academic work, but also by blurring the ethical boundaries by which students distinguish acceptable assistance from misconduct. Drawing on semi-structured interviews (n=20), AI chat logs, and course documents (syllabi, submitted assignments), we investigated how students themselves make moral sense of AI use in academic writing. Our analysis results in a range of novel findings: First, there are at least five distinct sites of AI-use conceptualization, ranging from faculty's intended AI policy, to students' actual AI use. Second, students use over 20 distinct rationalizations to justify AI use, such as that copying AI-generated text is victimless; that any AI text reflecting their own beliefs or their own style is their own writing; or that they are learning more by using AI -- even extensively -- than otherwise. We present a taxonomy of these rationalizations, and show how some of them are employed to justify conscious violations of course policies. Third, student rationalizations occur in both an ad hoc and post hoc manner, and they are not necessarily self-consistent. These and other findings suggest that modern AI presents a steep, ethical, slippery slope which students conceptually slide down, landing far outside the pedagogical goals and expectations of instructors. We discuss implications for educational design and AI policy.2026-05-27T20:44:18ZJiyoon KimKentaro ToyamaSangmi KimJohn M. Carrollhttp://arxiv.org/abs/2605.12613v2Creating Group Rules with AI: Human-AI Collaboration in WhatsApp Moderation2026-05-27T20:43:10ZWhatsApp is one of the most widely used messaging platforms globally, with billions of users sharing information in private groups. Yet, it offers little infrastructure to support moderation and group governance. In the absence of platform-level oversight, group admins bear the responsibility of governing group behavior. In this paper, we explore how WhatsApp group admins collaborate with AI tools to create, enforce, and maintain group rules. Drawing on a two-phase speculative design study with 20 admins in India, we examine how participants interacted with an AI assistant (Meta AI) to co-create rules and responded to a series of probes illustrating AI-assisted moderation features. Our findings show that while admins appreciated the AI's ability to surface overlooked rules and reduce their moderation burden, they were highly sensitive to issues of relational trust, data privacy, tone, and social context. We identify how group type and admin style shaped their willingness to delegate authority, and surface the limitations of current chatbot interfaces in supporting collaborative rule-making. We conclude with design implications for building moderation tools that center human judgment, relational nuance, contextual adaptability, and collective governance.2026-05-12T18:02:49ZCSCW 2026Gauri NayakFarhana ShahidKiran GarimellaAditya Vashisthahttp://arxiv.org/abs/2605.29064v1Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception2026-05-27T20:11:42ZWe study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.2026-05-27T20:11:42Z10 pages, 6 figuresNeemias da SilvaMyriam DelgadoRodrigo MinettoDaniel SilverThiago H Silvahttp://arxiv.org/abs/2605.29051v1Designing for the Moment: How One-Minute Interventions Fit or Falter Across Domains2026-05-27T19:54:24ZThis paper explores the design space for one-minute digital interventions that prompt immediate action without onboarding or sensing. By embracing Fogg's Behavior Model and four design principles informed by literature, the goal of these interventions was to provide triggers that encourage actions so simple that even people with low motivation would be willing to complete them. We examined the utility of these prompts by conducting a 14-day study with 22 participants interested in making small lifestyle improvements in at least one of three domains: physical activity, healthy eating, and mental well-being. When combined with insights drawn from participants' rewrites of our prompts, our findings suggest that intentional personalization through co-authorship could be a lightweight personalization mechanism that balances relevance with low friction.2026-05-27T19:54:24ZZahra HassanzadehAnne HsuRachel KornfieldDavid HaagAnanya BhattacharjeeJay OlsonJan David SmeddinckNorman FarbAlex MariakakisLydia ChiltonJoseph Jay Williams