https://arxiv.org/api/ZAOJsPV76s8jMxr3K/Yx7SNVDuE 2026-06-14T02:32:37Z 30934 225 15 http://arxiv.org/abs/2601.22599v2 A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation 2026-06-02T03:17:48Z

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://cslikai.cn/Hive.

2026-01-30T05:43:57Z Accepted to ICML 2026 Kai Li Jintao Cheng Chang Zeng Zijun Yan Helin Wang Zixiong Su Bo Zheng Xiaolin Hu http://arxiv.org/abs/2606.03020v1 Hanger Reflex Based Driving Assistance for Drivers with Peripheral Visual Field Defects 2026-06-02T01:50:35Z

Drivers with peripheral visual field defects may fail to notice pedestrians in their peripheral visual field, leading to delayed hazard awareness and increased collision risk. This study explores hanger reflex cue (HRC) as a driving assistance method for drivers with peripheral visual field defects, in which mechanical pressure is applied to specific regions of the head to facilitate anticipatory orientation toward potentially risky pedestrians and support safer driving. In a driving simulator experiment with 15 participants, we compared driving behavior with and without HRC during pedestrian encounters under simulated peripheral visual field defect. The results showed that HRC significantly shifted drivers' modal head rotation angle toward the risky pedestrian and significantly increased gaze duration toward that pedestrian. Collision occurrence was lower in the w/ HRC condition than in the w/o HRC condition, although the direct effect of HRC on collision occurrence showed only a marginal trend. A piecewise structural equation modeling analysis further suggested that HRC may contribute to collision reduction through a sequential pathway from head rotation to gaze allocation and then to collision occurrence. These findings provide preliminary evidence that HRC can support anticipatory attention allocation toward peripheral hazards and may offer a promising driving assistance method for drivers with visual field impairment.

2026-06-02T01:50:35Z Hailong Liu Junya Wada Toshihiro Hiraoka Junpei Kuwana Makoto Itoh Takahiro Wada http://arxiv.org/abs/2606.02996v1 MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry 2026-06-02T01:12:28Z

Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at https://spice-lab.org/projects/MARIO/.

2026-06-02T01:12:28Z CVPR 2026 Findings Yiquan Li Taeyoung Yeon Chenfeng Gao Vasco Xu Xuanyou Liu Karan Ahuja http://arxiv.org/abs/2606.02977v1 A Benchmarking Framework for Multimodal User Interface Toolkits: Comparing Modality Coverage, Developer Workflow, and Experimental Support 2026-06-02T00:31:38Z

Multimodal user interfaces increasingly combine speech, gesture, vision, gaze, touch, biosignals, and other sensor data. Recent toolkits from the past five years, such as Geno, Multisensor-Pipeline (MSP), ReactGenie, and EmoSync, aim to make it easier for developers to prototype such interfaces, while older work such as WAMI shows how early web-based multimodal systems were conceived. Yet the field still lacks a systematic and reusable way to compare what these toolkits actually support, how much implementation work they offload from developers, and which evaluation strategies are appropriate for them. This paper reframes an HCI seminar draft into a benchmarking framework paper for multimodal user interface toolkits. Rather than reporting completed empirical results, it proposes a structured benchmark based on document analysis, technical comparison, and a future developer-based evaluation. The framework is organized around three dimensions: modality coverage and interaction abstraction, developer experience and workflow, and experimental and integration support. The paper illustrates the framework through five representative toolkits: Geno, MSP, ReactGenie, WAMI, and EmoSync. The contribution is a reusable benchmark template that future researchers can instantiate with empirical measurements, developer studies, and additional multimodal toolkits.

2026-06-02T00:31:38Z 13 pages, 3 tables, 1 figure. Benchmarking framework paper revised and expanded from an HCI seminar draft Ariton Verush http://arxiv.org/abs/2606.02974v1 WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition 2026-06-02T00:25:46Z

Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: "No Presence" (empty room), "Walking", and "Walking + Arm-waving" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.

2026-06-02T00:25:46Z 8 pages, 5 figures Maheen Arshad Qindeel E Zahra Muhammad Khuram Shahzad http://arxiv.org/abs/2606.02970v1 From Explanation to Diagnosis: Next Generation Interactive Video Coach with Misstep Awareness 2026-06-02T00:15:51Z

Intelligent tutoring systems excel at generating explanations but rarely provide principled diagnosis of where and why a learner is wrong. We introduce a misstep-aware coaching capability for Ivy, a neurosymbolic AI coach, built on a two-model architecture that augments a Task-Method-Knowledge (TMK) model with a new Pedagogical Model (PM) in the context of an online graduate AI course at Georgia Tech. The PM makes instructor diagnostic knowledge explicit and machine-readable by encoding, for each quiz question and incorrect response, the learner's underlying belief(a brief statement of the incorrect idea or missing knowledge), a TMK locus(the source of the misunderstanding), a misconception type and targeted scaffolding derived from the instructor's Q\&A key. Using quiz questions from the course, we demonstrate a proof-of-concept pipeline that detects and classifies learner errors and generates diagnosis-grounded scaffolding, moving Ivy beyond knowledge retrieval toward diagnostic misstep awareness, and enabling more precise, actionable feedback that supports conceptual change and advances adaptive learning systems in AI in education and the learning sciences.

2026-06-02T00:15:51Z Xiao Jin Rahul K. Dass Ashok K. Goel http://arxiv.org/abs/2606.02962v1 Hand Trajectory Fusion for Egocentric Natural Language Query Grounding 2026-06-01T23:46:18Z

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

2026-06-01T23:46:18Z Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026 Enmin Zhong Carlos R. del-Blanco Fernando Jaureguizar Narciso García http://arxiv.org/abs/2606.02951v1 SCOPE: Real-Time Natural Language Camera Agent at the Edge 2026-06-01T23:07:44Z

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

2026-06-01T23:07:44Z 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026 Nikolaj Hindsbo Sina Ehsani Pragyana Mishra 10.1145/3757279.3785641 http://arxiv.org/abs/2606.02933v1 Characterization and Effects of CS2 Learning with GenAI, Visualization, and Human Support 2026-06-01T22:31:44Z

Generative AI (GenAI) is becoming a widely adopted learning support tool for both students and instructors, as it offers benefits such as personalized tutoring and scaffolded learning. However, recent research highlights potential drawbacks such as overreliance and metacognitive issues, especially in novice programmers. Most prior work focuses on introductory programming courses, and important questions remain about the underlying mechanisms behind the negative effects of GenAI and if findings can be generalized when students learn more advanced computer science concepts. To address this gap, we conducted a mixed-methods study comparing student interactions with GenAI to two traditional learning supports in a second-year algorithms course: algorithm visualization (AV) and human live tutoring (LT). Twelve students participated in three 90-minute study sessions focusing on sorting, tree, and graph algorithms. We recorded gaze and interaction data, and each session concluded with a test assessing their conceptual understanding of the topic. Our analysis classifies when during the problem-solving process participants sought help, and compares the interaction patterns across the three learning supports. Although GenAI produced a larger increase in self-efficacy compared to live tutoring, it was associated with noticeably lower results in learning outcomes. We found that participants did not use algorithm visualizations effectively, faced usage barriers when using GenAI to learn advanced topics, and that live tutoring yielded the highest learning outcomes.

2026-06-01T22:31:44Z Accepted at the ACM Conference on International Computing Education Research (ICER 2026) Quinton Yong Anthony Estey Miguel Nacenta 10.1145/3765964.3811663 http://arxiv.org/abs/2606.02883v1 LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems 2026-06-01T20:54:14Z

Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.

2026-06-01T20:54:14Z 30 pages total; 11 pages, 5 figures, 2 tables (main text); 19 pages, 11 figures, 9 tables (appendix) Amir Ghasemian Homa Hosseinmardi Upasana Dutta Duncan J. Watts http://arxiv.org/abs/2606.02839v1 Human Factors in Cybersecurity in Icelandic Small and Medium-sized Enterprises 2026-06-01T20:02:29Z

Cybersecurity threats are increasing in all aspects of society due to the integration of digital systems into modern-day life and a volatile geo-political landscape. Technical factors are an ongoing arms race; however, the threat surface from human and social factors is still present, often providing malicious actors the means to bypass complex technical security controls. Understanding human factors in light of technical evolution is essential to ensure security controls remain effective. This study presents the results of a survey on cybersecurity challenges within public and private sector organisations, including critical infrastructure providers, in Iceland (N = 130). From the management perspective, human factors were strongly noted as challenges and barriers to their organisations' security. These challenges include a lack of adequate training or awareness, hiring issues, poor cybersecurity culture, and time and/or financial resource constraints. Based on these findings, recommendations for mitigating threats from human factors are derived. These include: prioritising targeted over generic training to reduce employee fatigue, external government support for financially constrained organisations, and building a strong cybersecurity culture through constructive communication around shared responsibilities.

2026-06-01T20:02:29Z To be published in 17th EAI International Conference on Digital Forensics & Cyber Crime, 8 - 10 September 2026, Reykjavík, Iceland Goda Cicėnaitė Thomas Welsh Helmut Neukirchen http://arxiv.org/abs/2510.21011v3 Generating the Modal Worker: A Cross-Model Audit of Race and Gender in LLM-Generated Personas Across 41 Occupations 2026-06-01T18:35:39Z

As generative AI tools are increasingly used to portray people in professional roles, understanding their racial and gender representational biases is critical. We audit over 1.5 million occupational personas generated by four major large language models (GPT-4, Gemini 2.5, DeepSeek V3.1, and Mistral-medium) across 41 U.S. occupations. Comparing these personas against U.S. Bureau of Labor Statistics (BLS) data, we find that models generate demographics with less variation than real-world data, functionally compressing each occupation toward a dominant demographic profile rather than representing population-level variation. A shift/exaggeration decomposition reveals the structure of these distortions: White (-31 percentage points) and Black (-9 pp) workers are consistently underrepresented, while Hispanic (+17 pp) and Asian (+12 pp) workers are overrepresented, with stereotype exaggeration amplifying existing occupational segregation. These distortions are often extreme, including near-total portrayals of housekeepers as Hispanic and the near-erasure of Black workers from many occupations. Because these patterns recur across models with different institutional and cultural origins, they suggest shared structural sources of bias rather than model-specific artifacts. We argue that auditing generative AI requires evaluation frameworks that examine how synthetic populations systematically reshape demographic visibility across social roles.

2025-10-23T21:43:08Z Ilona van der Linden Sahana Kumar Arnav Dixit Aadi Sudan Smruthi Danda David C. Anastasiu Kai Lukoff http://arxiv.org/abs/2606.02763v1 InquiryBits: Sharing AI Conversation Traces to Support Collaboration Within Trust Boundaries 2026-06-01T18:27:16Z

AI chat tools are shifting problem-solving and brainstorming conversations away from colleagues and into private AI interactions, reducing the shared awareness that supports team coordination. We introduce InquiryBits, a system that shares minimal summaries of AI conversations within configurable trust boundaries, separating AI-only analysis from human-visible sharing. In a study with 80 professionals, we find that people are broadly willing to share these traces to support collaboration and avoid duplicating work - but only within bounded groups. Comfort drops sharply as audience expands beyond close teams; the level of detail shared matters less than who can see it, with a preference for more detail over less within trusted groups. These findings suggest that trust boundaries, more than information granularity, may be the most impactful design parameter.

2026-06-01T18:27:16Z 7 pages, 3 figures Caitlin Morris Pattie Maes http://arxiv.org/abs/2606.02425v1 Fostering Emotional Perspective-Taking: An Exploration of Affective Face-Tracking Interactions in the VR Narrative Rekindle 2026-06-01T16:03:30Z

Interest in leveraging emotions in Interactive Digital Narrative (IDN) has been growing, and Virtual Reality (VR) offers rich access to real-time biometric data such as facial expressions; yet this capability remains underexplored in novel IDN design. Existing approaches typically treat emotion input superficially, such as adjusting system difficulty or aesthetics, but rarely influence how players experience the narrative itself. Prior work also lacks a focus on a specific authored narrative. We propose an experimental affective interaction model that uses a VR headset's built-in face-tracking capability to recognize player emotional states, fostering "emotional perspective-taking" between the player and their embodied story character, thereby deepening the player's emotional connection to the character and their narrative engagement with the VR experience.

2026-06-01T16:03:30Z 5 pages, 5 figures. Interactivity paper accepted to DIS Companion '26 (Designing Interactive Systems Conference), Singapore, June 2026 Hector Fan Casper Hartveld Mark Sivak 10.1145/3802974.3808681 http://arxiv.org/abs/2606.02382v1 Attention Dynamics and Adaptive Decision Support in C5ISR: A Recurrence Quantification Analysis of Visual and Multimodal Attention Guidance Effects on Mission Performance 2026-06-01T15:31:46Z

Modern command, control, communications, computers, cyber, intelligence, surveillance, and reconnaissance (C5ISR) environments place substantial attentional demands on mission commanders. Failures in attention allocation in these high-risk settings can have severe operational consequences. This study investigates the efficacy of gaze-driven, attention-guided adaptive decision support tools, including visual-only and multimodal designs, in a high-fidelity simulated military command center. To characterize gaze and attentional dynamics during interaction with these tools, recurrence quantification analysis was applied to eye-tracking data. Stepwise regression using the Bayesian information criterion was then used to identify recurrence-based gaze metrics associated with performance. Results showed that the multimodal adaptive decision support tool was associated with significantly higher performance than the visual-only attention-guided tool. Average diagonal line length showed a negative linear association with performance, whereas entropy showed a positive linear association. Recurrence rate, determinism, and entropy also showed nonlinear quadratic relationships with performance. In particular, recurrence rate and determinism followed an inverted-U pattern consistent with the Yerkes-Dodson law. These findings suggest that effective performance in dynamic C5ISR contexts depends on a balance between structured and flexible visual scanning, and that recurrence-based gaze metrics can help characterize attentional dynamics during interaction with adaptive decision support systems.

2026-06-01T15:31:46Z 11 Figures, 3 Tables Hyun-Gee Jei Caleb J. Armstrong Farzan Sasangohar