https://arxiv.org/api/vpynqWVGnnplQf6wexDBQDIV4pE2026-06-14T03:24:45Z3093424015http://arxiv.org/abs/2606.02375v1WAXAL-NET: Finetuned Edge ASR Across 19 African Languages2026-06-01T15:22:35ZWe evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.2026-06-01T15:22:35ZVictor Tolulope OlufemiOreoluwa BabatundeRamsey NjemaBolarinwa GbotemiWanchi Lucia YenJohn UzodinmaSunday AjayiOluwademilade WilliamsKausar MoshoodInnocent Elendu AnyaeleAkebert ArefaineCandace HunzwiWongel Dawit DanielEmmilly NamugangaCleophas KadimaAthanase BahizireOnitsiky RanaivosonEmmanuel AaronNicholaus LadislausIdris MuhammedJonathan Enoch SimenyaMartin KoomeMatewos Tegete EndaylaluPeter Ifeoluwa AdeyemoHondi Prisca BirindwaUkachi Agnes Eze-MbeyYacoba Oduro-YeboahPericles AdjoviMikel K. NgueajioToluwani AremuPrasenjit Mitrahttp://arxiv.org/abs/2606.02305v1Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding2026-06-01T14:25:36ZUnderstanding how speech foundation models relate to human cortical activity is a key challenge for computational neuroscience. Here, we investigate how internal representations from Whisper predict intracranial ECoG responses during naturalistic speech perception. We introduce a time-resolved neural encoder that combines speech embeddings with a recurrent temporal model and soft attention, allowing us to examine layer-wise brain alignment. Intermediate Whisper layers provide the strongest correspondence with neural activity, supporting a hierarchical match between model representations and cortical speech processing. Comparisons with baselines show that high-resolution ECoG responses benefit from temporally structured modelling beyond linear mappings from the same speech representations. In addition, attention maps reveal temporally local alignment between speech embeddings and neural responses, while a phonemic interpretability analysis identifies anatomically coherent phoneme-category organization among encoding-informative electrodes. Together, these results suggest that speech foundation models offer a useful framework for studying time-resolved cortical speech representations.2026-06-01T14:25:36ZPresented at ICLR 2026 Workshop on Representational Alignment (Re-Align)Matteo CiferriTommaso BoccatoMichal OlakMatteo FerranteNicola Toschihttp://arxiv.org/abs/2606.02301v1Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video2026-06-01T14:23:17ZChronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.2026-06-01T14:23:17ZPranav MahajanAmanda WallEleonora Maria CameroneJulie StebbinsEoin KelleherShuangyi TongAnnina SchmidKatja WiechAnushka IraniBen Seymourhttp://arxiv.org/abs/2606.02260v1Guided Sensemaking: Agents in Collaborative Deliberation2026-06-01T13:48:24ZGenerative AI systems are aggressively reshaping how students engage with information and perform cognitive work; convenience-oriented use has the potential to displace effortful reasoning, reflection, and learning, especially for those who lack domain expertise and effective human-AI interaction strategies. Current AI tools are heavily focused on chat-style interfaces geared towards answer generation and efficiency in a linear and fragmented stream of text, offering limited support for structured reflection, argument construction, and sensemaking in collaborative contexts. We introduce Guided Sensemaking, an AI-augmented multiagent discourse platform that facilitates composition of well-thought-out ideas around a central question, provides scaffolding for critical thinking, and enables visualization of argumentative structure to support critical thinking and collaborative deliberation. The system uses several interactive agents to provide context-sensitive questioning prompts and a scaffolding for thought that exposes thematic clusters, agreements, and points of contention without collapsing diverse perspectives. This paper proposes a conceptual design and interaction paradigm that positions generative AI not as a shortcut to answers but as a research partner that externalizes reasoning, preserves user agency, and fosters structured, traceable sensemaking in educational and civic contexts.2026-06-01T13:48:24ZPresented at Tools for Thought (TfT) workshop at CHI 2026Aaditya BhatiaNavdeep Kaur BhatiaMarc-Antoine ParentJack Parkhttp://arxiv.org/abs/2603.22235v3ShapDBM: Exploring Decision Boundary Maps in Shapley Space2026-06-01T13:15:08ZDecision Boundary Maps (DBMs) are an effective tool for visualising machine learning classification boundaries. Yet, DBM quality strongly depends on the dimensionality reduction (DR) technique and high dimensional space used for the data points. For complex ML data, DR can create many mixed classes which yield DBMs that are hard to use or even misleading. We propose a new technique to compute DBMs by transforming data space into Shapley space and computing DR on it. Compared to DBMs computed directly from data, our maps have similar or higher quality metric values and visibly more compact, easier to explore, decision zones that better agree with measured model performance.2026-03-23T17:31:20Z4 pages and 3 figures (excluding supplementary material)Luke WatkinDaniel ArchambaultAlex Teleahttp://arxiv.org/abs/2606.02208v1Context-Aware Workflow Decomposition for Automated Mobile UI Annotation Using Multimodal Large Language Models2026-06-01T13:08:07ZAccurate mobile user interface annotation is important for UI understanding, accessibility tools, automated testing, dataset construction, and GUI agents. However, mobile screens are difficult to annotate because they often contain small, dense, nested, and visually ambiguous elements. Multimodal large language models can help automate this process, but their outputs are sensitive to prompt design and the organization of annotation tasks. This paper studies automated mobile UI annotation from a workflow design perspective, focusing on improving annotation precision. Rather than asking the model to annotate all UI elements in a single step, the task is divided into smaller context-aware stages, allowing related UI elements to be handled with clearer instructions and useful screen context. The proposed pipeline uses structured prompts, schema-constrained JSON outputs, and element-specific annotation instructions. Experiments are conducted on expert-annotated mobile UI screens from the MUIAnno dataset, using eight common UI element types: button, tab, clickable text, card, label, plain text, icon, and image. Four workflow strategies are evaluated: one-step, two-step, four-step, and eight-step annotation. Results show that the two-step workflow achieves the highest precision, while deeper decomposition improves recall but produces more false positives. Additional grouping experiments show that annotation quality depends on both workflow depth and element-class grouping. Overall, careful workflow design can make LLM-based mobile UI annotation more reliable for UI understanding, dataset construction, and GUI agent development.2026-06-01T13:08:07ZAthar ParvezMuhammad Jawad MuftiMuqaddas GullOmar Hammadhttp://arxiv.org/abs/2606.02082v1Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment2026-06-01T11:12:28ZThis paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.2026-06-01T11:12:28ZXiyang HuangRenxiong WeiYihuai XuZhiyuan ChenKeying WuJiayi XiangBuzhou TangYanqing YeJinyu ChenCheng ZengMin PengQianqian XieSophia Ananiadouhttp://arxiv.org/abs/2606.02668v1What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents2026-06-01T11:08:17ZCoding agents gate consequential actions behind a human-in-the-loop approval dialog, but the dialog is narrated by the agent itself: the human approves a summary the agent writes. The Lies-in-the-Loop (LITL) attack shows that summary is forgeable, so a compromised agent can show a benign description while a different action runs. This paper names the missing property, Consent Integrity, by importing What You See Is What You Sign (WYSIWYS) and the trusted-path property into the agent approval channel: the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Two twists distinguish it from classical WYSIWYS: the renderer is the adversary, and the boundary ground truth is a low-level event that must be decoded without trusting the agent. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented. On GTFOBins, an independent corpus of 1330 trusted-tool abuses, the prototype silently passes 10.0% (every instance through a trusted tool); on tldr, 28,798 normal-usage commands, it marks 87.0% uninspectable. These two independent measurements bracket the design's central tension: the trust list that bounds silent passes is the same one that drives over-prompting, and a boundary-only mediator can move along that frontier but not escape it. The contribution is the property, the mechanism, and an honest position on that frontier, not a solved defense.2026-06-01T11:08:17ZPreprint. IEEE conference format. Proof-of-concept; artifact at https://github.com/zjnbwxq/agentguard-ciXiaoqi Wenghttp://arxiv.org/abs/2606.02037v1Respectful Things: Adding Social Intelligence to 'Smart' Devices2026-06-01T10:26:21ZIn this paper, we propose that the idea of devices respecting their end-users may serve as a strong design goal for highly personal and intimate smart devices. We ask what respect is, how it shapes interaction, and how good-faith simulation of respect might inform user-friendly smart device design. Respect is a natural and integral part of natural human relationships that is seen to shape work and personal relations. In a basic sense, this is the core purpose of smart things: we expect them to be ready and willing to help us. In this vein, we distil the characteristics of more complex respectful behaviours into 4 main types relevant to smart devices, drawing from philosophical analyses of the conceptual dimensions of respect: directive respect, obstacle respect, recognition respect, and care respect. We discuss the implications of each of these kinds of respect for the future of smart personal devices.2026-06-01T10:26:21ZIn Proceedings of the 2018 Living in the Internet of Things: Cybersecurity of the IoT ConferenceMax Van KleekWilliam SeymourReuben BinnsNigel Shadbolt10.1049/cp.2018.0006http://arxiv.org/abs/2606.01976v1AutoBG: A Board Game Design Assistant with Interactive Ideation, Iterative Rulebook Generation, and Individualized Feedback2026-06-01T09:37:51ZDesigning a board game demands both thinking as a designer and experiencing as a player, while iterating through repeated prototyping and playtesting cycles, making it a cognitively intensive creative task well suited for human-AI collaboration. However, current systems lack end-to-end support to guide designers through the complete workflow from vague early ideation to iterative rulebook revision and audience testing. To this end, we present AutoBG, a board game design assistant built around critic-driven iterative refinement, comprising four specialized modules: BG-Ideator guides designers via multi-turn dialogue to produce structured design drafts; BG-Realizer generates complete rulebooks from drafts and revises them in a closed loop with BG-Critic, which diagnoses design flaws and gates each revision so that only verified improvements are accepted; and BG-Persona simulates individualized feedback from 150 real player profiles. Together, these modules enable designers to go from an initial idea to a polished, audience-tested rulebook within a single integrated workflow. The system is built on 2.2K structured rulebooks and 180K quality-filtered real player reviews, with task-specific training data derived for each module. Experiments on 207 held-out games show that AutoBG substantially outperforms state-of-the-art baselines (e.g., GPT-5.4), generating rulebooks that approach the quality of published games. Furthermore, a user study with 30 participants across diverse experience levels confirms that AutoBG effectively reduces blank-page anxiety, surfaces hidden design flaws, and provides highly rated, practical assistance throughout the creative process.2026-06-01T09:37:51ZZizhen LiChuanhao LiYibin WangJianwen SunYukang FengFanrui ZhangMingzhu SunYifei HuangKaipeng Zhanghttp://arxiv.org/abs/2606.01969v1Trust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM-Generated Multi-File Changes2026-06-01T09:32:25ZBackground: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario.
Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task.
Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners.
Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50--3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development.
Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention.2026-06-01T09:32:25ZSubmitted to ESEM SEIP 2026Lo Gullstrand HeanderAgnia SergeyukIlya ZakharovEmma SöderbergNikita Mukhortovhttp://arxiv.org/abs/2604.13860v4"AI Psychosis" in Context: How Conversation History Shapes LLM Responses to Delusional Beliefs2026-06-01T06:44:11ZExtended interaction with large language models (LLMs) has been linked to the reinforcement of delusional beliefs, attracting clinical and public concern. Yet most empirical work evaluates model safety in brief interactions, which may not reflect how harms develop through sustained dialogue. Five LLMs were tested across three levels of accumulated context, using the same escalating delusional conversation history to isolate its effect on model behaviour. Responses were coded on risk and safety dimensions, and each model was analysed qualitatively. Models separated into two distinct tiers: GPT-4o, Grok 4.1 Fast, and Gemini 3 Pro exhibited high-risk, low-safety profiles; Claude Opus 4.5 and GPT-5.2 Instant displayed the opposite pattern. As context accumulated, performance degraded in the unsafe group, while the same material activated stronger safety interventions among safer models. Qualitative analysis identified distinct mechanisms of failure, including validating the user's delusional premises, elaborating beyond them with new content, and attempting harm reduction from within the delusional frame. Safer models, however, often used the established relationship to support intervention, challenging delusional beliefs and directing the user to external support. These findings indicate that accumulated context functions as a stress test of safety architecture, revealing whether prior dialogue is treated as a worldview to inherit or evidence to evaluate. Short-context assessments may therefore mischaracterise model safety, underestimating danger in some systems while missing context-activated gains in others. The results suggest that delusion reinforcement is a tractable alignment failure, with safer models establishing a baseline that future systems should now be expected to meet.2026-04-15T13:27:23ZLuke NichollsRobert HuttoZephrah SotoHamilton MorrinThomas PollakRaj KorpanCheryl Carmichaelhttp://arxiv.org/abs/2605.14830v2Agentic AI and Human-in-the-Loop Interventions: Field Experimental Evidence from Alibaba's Customer Service Operations2026-06-01T06:04:55ZAgentic AI systems that autonomously perform service tasks are entering customer service operations. However, limited evidence exists on how human interventions shape service outcomes when agentic AI failures create both cognitive and emotional consequences. We study this issue through a randomized field experiment on Alibaba's Taobao platform. Workers in the treatment condition supervised an agentic AI system that resolved AI-eligible chats while continuing to handle AI-ineligible chats, whereas control workers resolved all chats without agentic AI. The findings show that AI deployment reduces average chat duration and has limited effects on retrial rates, but substantially lowers ratings for AI-eligible chats. Moreover, human intervention effectiveness in AI-eligible chats depends on the nature of AI failure, post-escalation intervention effort, and intervention timing. Human intervention preserves service quality in algorithm-triggered technical escalations, i.e., unresolved customer issues beyond the AI's capability, but is less effective in algorithm-triggered emotional escalations, i.e., where customers express frustration or dissatisfaction. These differences are partly explained by variation in workers' post-escalation intervention effort across escalation types. In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision. We further find that early intervention is essential for sustaining high post-escalation intervention effort. Finally, we document a positive spillover effect on AI-ineligible chats, as treated workers adapted their multitasking workflow to devote greater attention to these chats. These findings offer implications for human-in-the-loop process design in human-AI collaboration systems.2026-05-14T13:35:52ZYiwei WangChuan ZhuTianjun FengLauren Xiaoyuan LuBingxin Jiahttp://arxiv.org/abs/2604.01562v2Acoustic and perceptual differences between standard and accented speech and their voice clones2026-06-01T05:58:54ZVoice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.2026-04-02T03:17:41ZTianle YangChengzhe SunPhil RoseSiwei Lyuhttp://arxiv.org/abs/2511.06676v3How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models2026-06-01T05:41:41ZNow that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.2025-11-10T03:49:58Z9 pages, 5 figures, 4 tables, 14 references. Preliminary abstract presented at the International Conference on Envisioning the Himalayan Future: Pathways to Sustainability and Development (PUiCON 2026) p. 105; abstract available online at: https://pufoe.edu.np/wp-content/uploads/2026/05/PUiCON_2026_Book_of-_Abstracts.pdfSubhojit Ghimire