https://arxiv.org/api/2EJV0xOsOKI8I1XZfM/J2Oy8DSU 2026-06-14T09:51:05Z 30934 330 15 http://arxiv.org/abs/2606.07594v1 Syll: Open-Source Personal Automation with Cross-Surface Execution 2026-05-28T17:59:31Z Personal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single interface and offer limited support for user teaching and auditability. We present Syll, an open-source, self-hosted multimodal agent harness that unifies MCP/API tools, CLI execution, and visual GUI control in a modular runtime, enabling agents to coordinate computer use across heterogeneous interfaces while streamlining how users and agents exchange information. At the core of Syll is a bidirectional user-agent interaction layer: users teach procedures through direct demonstration, which Syll compiles into reusable skills; agent execution is translated back into multimodal evidence -- logs, keyframes, and approval checkpoints -- for inspection and control. Syll further externalizes memory, skills, routines, and governance as editable local artifacts, supporting straightforward inspection, extension, and downstream development. Our implementation has been validated on production desktop applications including Adobe Photoshop, Adobe Audition, Stardew Valley, macOS Finder and others. We report mechanism-oriented studies that validate multimodal routing, teachable GUI replay, and persistent local artifacts. We hope Syll can serve as a practical open-source foundation for personal automation that users can teach, inspect, and continuously extend. 2026-05-28T17:59:31Z Code: https://github.com/THU-SAGE/syll Bo Zhang Borui Zhang Chenghao Jiang Minglei Shi Xiaofeng Wang Zheng Zhu Jie Zhou Jiwen Lu http://arxiv.org/abs/2605.30273v1 LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback 2026-05-28T17:30:57Z Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts. 2026-05-28T17:30:57Z Jiwon Kim Maya Ajit Sherry Gong Soorya Ram Shimgekar Dong Whi Yoo Eshwar Chandrasekharan Koustuv Saha http://arxiv.org/abs/2605.30256v1 VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents 2026-05-28T17:20:01Z Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents. 2026-05-28T17:20:01Z Project page: https://research.nvidia.com/labs/amri/projects/video-fdb/ Amrita Mazumdar Seonwook Park Rajarshi Roy Nikhil Srihari Shengze Wang Yuhao Zhou Julia Wang Koki Nagano Shalini De Mello http://arxiv.org/abs/2605.30152v1 Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor? 2026-05-28T16:10:32Z Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes. 2026-05-28T16:10:32Z 31 pages, 5 figures, 7 tables Xiaoze Liu Ruowang Zhang Amir H. Abdi Michel Galley Zhikai Chen Siheng Xiong Xiaoqian Wang Jing Gao http://arxiv.org/abs/2605.30127v1 REACT: A Conditioning Framework for User-Adaptive sEMG Hand Pose Estimation 2026-05-28T15:58:13Z Surface electromyography (sEMG) enables continuous hand pose estimation on wearable devices, but models trained on multi-user corpora degrade on unseen individuals due to inter-user variability in anatomy and electrode placement. We propose REACT, a lightweight conditioning framework that personalizes a frozen pretrained EMG-to-pose backbone at inference time using only a handful of calibration recordings. REACT learns a compact user embedding from calibration data and applies Feature-wise Linear Modulation (FiLM) to adapt the shared encoder's feature space, requiring no gradient updates at deployment. On the large-scale EMG2POSE benchmark, REACT improves over the state-of-the-art baseline across all three generalization splits in both regression and tracking modes, reducing angular error by up to 3.9% with minimal parameter overhead and under 45 seconds of per-user calibration. 2026-05-28T15:58:13Z 6 pages, 3 figures Eric Xie Hei Shing Cheung http://arxiv.org/abs/2605.29943v1 A Domain-Informed Multi-Objective Framework for EEG Channel Selection in Motor Imagery BCIs 2026-05-28T13:54:39Z Motor imagery (MI) classification using electroencephalography (EEG) signals is essential for advancing brain-computer interfaces (BCIs). Traditional EEG channel selection methods often face limitations, such as dependency on single-objective criteria and susceptibility to local optima. To address these challenges, this work proposes a multi-objective optimisation framework that employs non-dominated sorting genetic algorithm, multiple-objective particle swarm optimisation, and a multi-objective evolutionary algorithm based on decomposition. Our approach effectively balances spatial relevance, using a Gaussian kernel, and functional discriminability, which assesses intratrial task-related desynchronisation, thereby improving performance. We evaluated this framework on four EEG datasets: Physionet, OpenBMI, HighGamma, and BCIIV-2A. The proposed approach successfully identifies compact, relevant channel subsets concentrated around sensorimotor cortex regions linked to MI activity, addressing the prevalent challenges of dimensionality and complexity inherent to traditional techniques. Furthermore, the framework achieved classification performance of 87%, 71%, 75%, and 65% on the Physionet, OpenBMI, HighGamma, and BCIIV-2A datasets, respectively. By outperforming existing single-objective and accuracy-based methods, and those relying on fixed subsets, these findings demonstrate that this new multi-objective optimisation framework can enhance MI-based BCI performance while facilitating compact channel configurations with reduced computational complexity, making them better suited for wearable, portable, and real-time BCI applications. 2026-05-28T13:54:39Z This work has been submitted to the IEEE for possible publication Dekka Muni Kumar Dhruba Jyoti Kalita Yogesh Kumar Meena http://arxiv.org/abs/2604.27676v2 Users' Activity Logs: the Good, the Bad, the Misconception, and the Disastrous 2026-05-28T11:02:15Z Most service providers, such as Google, save logs from data generated by users while using the service. Many service providers provide users with privacy controls to manage whether, how, and for how long the data is saved and used by the service provider. While most prior studies focused on the negative side of users' activity logs, such as users' lack of awareness about the logs' privacy controls and users' privacy concerns toward their data, this work aims to provide a balanced view of users' perceptions regarding activity logs by considering the positive, negative, and extremely negative (hence disastrous) sides, as well as the misconceptions of activity logs. In this work, we present a case study of Google's Activity controls by conducting a secondary analysis of interview data from 30 Google personal account holders in Saudi Arabia. Using template analysis, we analyzed the data from the lens of four main themes: the good, the bad, the misconception, and the disastrous aspects of users' activity logs from the users' perspective. Our findings uncover new themes and use cases, offering a balanced view of users' perceptions of activity logs, and provide a better understanding and a useful source for subsequent studies on related topics. We conclude with practical recommendations for service providers, privacy researchers and experts, and users alike. 2026-04-30T10:09:50Z Published at the Information and Computer Security Journal (Emerald Publishing) Eman Alashwali 10.1108/ICS-12-2025-0541 http://arxiv.org/abs/2605.29677v1 Embodied Virtual Reality Feedback Reshapes Neural Representations to Support Continuous Three-Dimensional Motor Imagery Decoding 2026-05-28T09:39:11Z Continuous brain-computer interfaces (BCIs) that decode motion trajectories from imagined movement offer intuitive motor control, yet how feedback modality and longitudinal training shape neural representations and decoding performance remains poorly understood. We present the first systematic investigation of embodied virtual reality (VR) feedback during real-time 3D virtual limb control driven by motor imagery, across ten longitudinal sessions in ten participants. Performance was evaluated using three strategies: actual online performance (Fixed Decoder Generalisation, FDG), periodic retraining (Sequential Adaptive Training, SAT), and within-session upper-bound estimation (Within-Session Reconstruction, WSR). A CNN-LSTM decoder achieved within-session imagined movement correlations of r = 0.762 under VR and r = 0.672 under screen feedback. VR significantly outperformed screen feedback across all strategies and movement dimensions (improvements of 8.9-13.0%, all p <= 0.002, d = 1.42-2.05). This advantage persisted under fixed decoders without retraining, demonstrating that embodied VR feedback elicits inherently more decodable and generalisable neural representations. Linear mixed-effects modelling confirmed robust main effects of feedback modality and movement axis with no interaction. Neurophysiologically, VR produced stronger sensorimotor-parietal desynchronisation and enhanced motor-frontal functional connectivity, with pervasive anterior insula engagement across all frequency bands and increased superior parietal lobule coupling, paralleling patterns associated with real movement execution. These findings establish embodied spatial feedback as a key design principle for next-generation continuous BCIs targeting intuitive motor control and neurorehabilitation. 2026-05-28T09:39:11Z 28 pages, 7 figures, 3 tables. Submitted to Nature Biomedical Engineering. Data to be made available via Zenodo (DOI: 10.5281/zenodo.16047021) Niall McShane Attila Korik Karl McCreadie Naomi Du Bois Darryl Charles Damien Coyle http://arxiv.org/abs/2310.01935v2 Practitioners' Perspectives on Designing Data Visualizations for the General Public 2026-05-28T09:38:44Z Public-facing data visualizations can play a vital role in making complex information clear and engaging, thereby encouraging informed public discourse and participation. However, existing work offers limited insight into how practitioners make design decisions based on their envisioned target audiences and across different media channels. To investigate this, we conducted semi-structured interviews with 21 professionals from journalistic settings, focusing on how they conceptualize their readers, translate these notions into design choices, and evaluate their work. We found that practitioners often rely on broad audience definitions, despite considering ``knowing their readers'' essential. Evaluation primarily relies on peer feedback or social metrics rather than user testing. From these accounts, we identify recurring strategies employed to reach general, often undefined publics. We discuss implications for audience-centered authoring tools, proposing features such as persona simulations and content-adaptive multi-format authoring, message-first rhetoric-aware workflows, and lightweight in-tool evaluation to better support the realities of public-facing design. 2023-10-03T10:18:02Z Accepted at CHI 2026 In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26). Association for Computing Machinery, New York, NY, USA, Article 1185, 1-19 Regina Schuster Kathleen Gregory Torsten Möller Laura Koesten 10.1145/3772318.3790627 http://arxiv.org/abs/2605.29675v1 From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration 2026-05-28T09:35:59Z Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced. 2026-05-28T09:35:59Z Ngoc Luyen Le Marie-Hélène Abel Bertrand Laforge http://arxiv.org/abs/2503.17971v2 Generating Multimodal Textures with a Soft Hydro-Pneumatic Haptic Ring 2026-05-28T09:21:20Z The growing adoption of extended reality (XR) has increased demand for wearable technologies that provide naturalistic tactile sensations while allowing users to interact freely with their environments using bare fingers. However, most existing wearable haptic devices support only a limited range of tactile modalities. Here, we introduce a soft haptic ring and a data-driven rendering methodology for generating multimodal texture sensations. The device integrates pneumatic and hydraulic actuation to render roughness, thermal, and softness cues on the proximal phalanx. The ring can generate forces up to 1.75 N, produce displacements up to 0.27 mm within a 30-300 Hz operating range, and modulate display temperature by up to 25 Celcius within 65 s. The rendering methodology modulates these cues based on the user's exploratory actions: the hydraulic actuator conveys perceived temperature during static contact, while the pneumatic actuator generates pressure and vibration cues to convey softness and roughness during pressing and sliding gestures, respectively. We evaluated the system in a user study with 15 participants who matched six virtual textures generated by the ring to their real counterparts and rated their perceived sensations using guided exploratory actions. Participants achieved an average texture-matching precision of 68% and an F1 score of 0.68. Adjective ratings confirmed that the ring produces distinct and perceptually rich stimuli across all rendered modalities. These findings demonstrate the potential of the proposed haptic ring and rendering methodology to deliver multimodal tactile cues away from the fingertip for immersive XR applications, enabling diverse tactile feedback while preserving natural physical interaction. 2025-03-23T07:40:32Z 29 pages, 19 figures, journal International Journal of Human-Computer Studies. volume 212, pages 103814, 2026 Ana Sanz Cozcolluela Koen Wosten Yasemin Vardar 10.1016/j.ijhcs.2026.103814 http://arxiv.org/abs/2304.10544v2 What is the message? Perspectives on Visual Data Communication 2026-05-28T08:42:02Z Data visualizations are widely used to communicate messages about urgent topics such as climate change and public health. However, we still know little about how these visualizations are produced and interpreted in popular science contexts. In this mixed-method study, we examine how data are visually communicated and understood in the popular science magazine Scientific American, focusing on the messages these visualizations convey. To capture this complexity, we analyze data visualizations about climate change and pandemics in Scientific American over the past fifty years from three complementary perspectives: reader, chart, and producer. From the reader's perspective, we articulate takeaway messages and document sensemaking, interpreting visualizations first without and then with textual elements. From the chart perspective, we examine how visual features and text shape interpretation. From the producer's perspective, we draw on interviews with Scientific American staff to understand message planning and compare a sample of their intended messages with those we interpreted. Using takeaway messages as our central analytic lens, we develop a message typology and show that messages vary systematically across dimensions such as granularity, articulation, and inference. A key finding is that text plays a pivotal role: approximately two-thirds of messages change when textual elements are added. While the interviews highlighted the central role of message planning in visualization production, intended and interpreted messages only partially aligned. Our findings underscore the importance of contextual clarity and audience-aware communication, and we derive recommendations for visualization designers and science communicators. 2023-04-12T15:22:17Z Regina Schuster Kathleen Gregory Christian Knoll Torsten Möller Laura Koesten http://arxiv.org/abs/2605.29572v1 Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models 2026-05-28T08:20:01Z Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces. 2026-05-28T08:20:01Z 12 pages, 3 figures, journal Li Zou Yasemin Vardar http://arxiv.org/abs/2605.29543v1 SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring 2026-05-28T07:56:24Z Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring. 2026-05-28T07:56:24Z Qihan Deng Minghua Zhang Yang Yang Zhenyu Gao http://arxiv.org/abs/2510.20743v2 Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations 2026-05-28T07:50:13Z We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges. 2025-10-23T17:08:03Z Lorenzo Stacchio Andrea Ubaldi Alessandro Galdelli Maurizio Mauri Emanuele Frontoni Andrea Gaggioli