https://arxiv.org/api/2EJV0xOsOKI8I1XZfM/J2Oy8DSU2026-06-14T09:51:05Z3093433015http://arxiv.org/abs/2606.07594v1Syll: Open-Source Personal Automation with Cross-Surface Execution2026-05-28T17:59:31ZPersonal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single interface and offer limited support for user teaching and auditability. We present Syll, an open-source, self-hosted multimodal agent harness that unifies MCP/API tools, CLI execution, and visual GUI control in a modular runtime, enabling agents to coordinate computer use across heterogeneous interfaces while streamlining how users and agents exchange information. At the core of Syll is a bidirectional user-agent interaction layer: users teach procedures through direct demonstration, which Syll compiles into reusable skills; agent execution is translated back into multimodal evidence -- logs, keyframes, and approval checkpoints -- for inspection and control. Syll further externalizes memory, skills, routines, and governance as editable local artifacts, supporting straightforward inspection, extension, and downstream development. Our implementation has been validated on production desktop applications including Adobe Photoshop, Adobe Audition, Stardew Valley, macOS Finder and others. We report mechanism-oriented studies that validate multimodal routing, teachable GUI replay, and persistent local artifacts. We hope Syll can serve as a practical open-source foundation for personal automation that users can teach, inspect, and continuously extend.2026-05-28T17:59:31ZCode: https://github.com/THU-SAGE/syllBo ZhangBorui ZhangChenghao JiangMinglei ShiXiaofeng WangZheng ZhuJie ZhouJiwen Luhttp://arxiv.org/abs/2605.30273v1LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback2026-05-28T17:30:57ZLarge language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.2026-05-28T17:30:57ZJiwon KimMaya AjitSherry GongSoorya Ram ShimgekarDong Whi YooEshwar ChandrasekharanKoustuv Sahahttp://arxiv.org/abs/2605.30256v1VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents2026-05-28T17:20:01ZNatural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.2026-05-28T17:20:01ZProject page: https://research.nvidia.com/labs/amri/projects/video-fdb/Amrita MazumdarSeonwook ParkRajarshi RoyNikhil SrihariShengze WangYuhao ZhouJulia WangKoki NaganoShalini De Mellohttp://arxiv.org/abs/2605.30152v1Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?2026-05-28T16:10:32ZProactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.2026-05-28T16:10:32Z31 pages, 5 figures, 7 tablesXiaoze LiuRuowang ZhangAmir H. AbdiMichel GalleyZhikai ChenSiheng XiongXiaoqian WangJing Gaohttp://arxiv.org/abs/2605.30127v1REACT: A Conditioning Framework for User-Adaptive sEMG Hand Pose Estimation2026-05-28T15:58:13ZSurface electromyography (sEMG) enables continuous hand pose estimation on wearable devices, but models trained on multi-user corpora degrade on unseen individuals due to inter-user variability in anatomy and electrode placement. We propose REACT, a lightweight conditioning framework that personalizes a frozen pretrained EMG-to-pose backbone at inference time using only a handful of calibration recordings. REACT learns a compact user embedding from calibration data and applies Feature-wise Linear Modulation (FiLM) to adapt the shared encoder's feature space, requiring no gradient updates at deployment. On the large-scale EMG2POSE benchmark, REACT improves over the state-of-the-art baseline across all three generalization splits in both regression and tracking modes, reducing angular error by up to 3.9% with minimal parameter overhead and under 45 seconds of per-user calibration.2026-05-28T15:58:13Z6 pages, 3 figuresEric XieHei Shing Cheunghttp://arxiv.org/abs/2605.29943v1A Domain-Informed Multi-Objective Framework for EEG Channel Selection in Motor Imagery BCIs2026-05-28T13:54:39ZMotor imagery (MI) classification using electroencephalography (EEG) signals is essential for advancing brain-computer interfaces (BCIs). Traditional EEG channel selection methods often face limitations, such as dependency on single-objective criteria and susceptibility to local optima. To address these challenges, this work proposes a multi-objective optimisation framework that employs non-dominated sorting genetic algorithm, multiple-objective particle swarm optimisation, and a multi-objective evolutionary algorithm based on decomposition. Our approach effectively balances spatial relevance, using a Gaussian kernel, and functional discriminability, which assesses intratrial task-related desynchronisation, thereby improving performance. We evaluated this framework on four EEG datasets: Physionet, OpenBMI, HighGamma, and BCIIV-2A. The proposed approach successfully identifies compact, relevant channel subsets concentrated around sensorimotor cortex regions linked to MI activity, addressing the prevalent challenges of dimensionality and complexity inherent to traditional techniques. Furthermore, the framework achieved classification performance of 87%, 71%, 75%, and 65% on the Physionet, OpenBMI, HighGamma, and BCIIV-2A datasets, respectively. By outperforming existing single-objective and accuracy-based methods, and those relying on fixed subsets, these findings demonstrate that this new multi-objective optimisation framework can enhance MI-based BCI performance while facilitating compact channel configurations with reduced computational complexity, making them better suited for wearable, portable, and real-time BCI applications.2026-05-28T13:54:39ZThis work has been submitted to the IEEE for possible publicationDekka Muni KumarDhruba Jyoti KalitaYogesh Kumar Meenahttp://arxiv.org/abs/2604.27676v2Users' Activity Logs: the Good, the Bad, the Misconception, and the Disastrous2026-05-28T11:02:15ZMost service providers, such as Google, save logs from data generated by users while using the service. Many service providers provide users with privacy controls to manage whether, how, and for how long the data is saved and used by the service provider. While most prior studies focused on the negative side of users' activity logs, such as users' lack of awareness about the logs' privacy controls and users' privacy concerns toward their data, this work aims to provide a balanced view of users' perceptions regarding activity logs by considering the positive, negative, and extremely negative (hence disastrous) sides, as well as the misconceptions of activity logs. In this work, we present a case study of Google's Activity controls by conducting a secondary analysis of interview data from 30 Google personal account holders in Saudi Arabia. Using template analysis, we analyzed the data from the lens of four main themes: the good, the bad, the misconception, and the disastrous aspects of users' activity logs from the users' perspective. Our findings uncover new themes and use cases, offering a balanced view of users' perceptions of activity logs, and provide a better understanding and a useful source for subsequent studies on related topics. We conclude with practical recommendations for service providers, privacy researchers and experts, and users alike.2026-04-30T10:09:50ZPublished at the Information and Computer Security Journal (Emerald Publishing)Eman Alashwali10.1108/ICS-12-2025-0541http://arxiv.org/abs/2605.29677v1Embodied Virtual Reality Feedback Reshapes Neural Representations to Support Continuous Three-Dimensional Motor Imagery Decoding2026-05-28T09:39:11ZContinuous brain-computer interfaces (BCIs) that decode motion trajectories from imagined movement offer intuitive motor control, yet how feedback modality and longitudinal training shape neural representations and decoding performance remains poorly understood. We present the first systematic investigation of embodied virtual reality (VR) feedback during real-time 3D virtual limb control driven by motor imagery, across ten longitudinal sessions in ten participants. Performance was evaluated using three strategies: actual online performance (Fixed Decoder Generalisation, FDG), periodic retraining (Sequential Adaptive Training, SAT), and within-session upper-bound estimation (Within-Session Reconstruction, WSR). A CNN-LSTM decoder achieved within-session imagined movement correlations of r = 0.762 under VR and r = 0.672 under screen feedback. VR significantly outperformed screen feedback across all strategies and movement dimensions (improvements of 8.9-13.0%, all p <= 0.002, d = 1.42-2.05). This advantage persisted under fixed decoders without retraining, demonstrating that embodied VR feedback elicits inherently more decodable and generalisable neural representations. Linear mixed-effects modelling confirmed robust main effects of feedback modality and movement axis with no interaction. Neurophysiologically, VR produced stronger sensorimotor-parietal desynchronisation and enhanced motor-frontal functional connectivity, with pervasive anterior insula engagement across all frequency bands and increased superior parietal lobule coupling, paralleling patterns associated with real movement execution. These findings establish embodied spatial feedback as a key design principle for next-generation continuous BCIs targeting intuitive motor control and neurorehabilitation.2026-05-28T09:39:11Z28 pages, 7 figures, 3 tables. Submitted to Nature Biomedical Engineering. Data to be made available via Zenodo (DOI: 10.5281/zenodo.16047021)Niall McShaneAttila KorikKarl McCreadieNaomi Du BoisDarryl CharlesDamien Coylehttp://arxiv.org/abs/2310.01935v2Practitioners' Perspectives on Designing Data Visualizations for the General Public2026-05-28T09:38:44ZPublic-facing data visualizations can play a vital role in making complex information clear and engaging, thereby encouraging informed public discourse and participation. However, existing work offers limited insight into how practitioners make design decisions based on their envisioned target audiences and across different media channels. To investigate this, we conducted semi-structured interviews with 21 professionals from journalistic settings, focusing on how they conceptualize their readers, translate these notions into design choices, and evaluate their work. We found that practitioners often rely on broad audience definitions, despite considering ``knowing their readers'' essential. Evaluation primarily relies on peer feedback or social metrics rather than user testing. From these accounts, we identify recurring strategies employed to reach general, often undefined publics. We discuss implications for audience-centered authoring tools, proposing features such as persona simulations and content-adaptive multi-format authoring, message-first rhetoric-aware workflows, and lightweight in-tool evaluation to better support the realities of public-facing design.2023-10-03T10:18:02ZAccepted at CHI 2026In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26). Association for Computing Machinery, New York, NY, USA, Article 1185, 1-19Regina SchusterKathleen GregoryTorsten MöllerLaura Koesten10.1145/3772318.3790627http://arxiv.org/abs/2605.29675v1From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration2026-05-28T09:35:59ZCollaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced.2026-05-28T09:35:59ZNgoc Luyen LeMarie-Hélène AbelBertrand Laforgehttp://arxiv.org/abs/2503.17971v2Generating Multimodal Textures with a Soft Hydro-Pneumatic Haptic Ring2026-05-28T09:21:20ZThe growing adoption of extended reality (XR) has increased demand for wearable technologies that provide naturalistic tactile sensations while allowing users to interact freely with their environments using bare fingers. However, most existing wearable haptic devices support only a limited range of tactile modalities. Here, we introduce a soft haptic ring and a data-driven rendering methodology for generating multimodal texture sensations. The device integrates pneumatic and hydraulic actuation to render roughness, thermal, and softness cues on the proximal phalanx. The ring can generate forces up to 1.75 N, produce displacements up to 0.27 mm within a 30-300 Hz operating range, and modulate display temperature by up to 25 Celcius within 65 s. The rendering methodology modulates these cues based on the user's exploratory actions: the hydraulic actuator conveys perceived temperature during static contact, while the pneumatic actuator generates pressure and vibration cues to convey softness and roughness during pressing and sliding gestures, respectively. We evaluated the system in a user study with 15 participants who matched six virtual textures generated by the ring to their real counterparts and rated their perceived sensations using guided exploratory actions. Participants achieved an average texture-matching precision of 68% and an F1 score of 0.68. Adjective ratings confirmed that the ring produces distinct and perceptually rich stimuli across all rendered modalities. These findings demonstrate the potential of the proposed haptic ring and rendering methodology to deliver multimodal tactile cues away from the fingertip for immersive XR applications, enabling diverse tactile feedback while preserving natural physical interaction.2025-03-23T07:40:32Z29 pages, 19 figures, journalInternational Journal of Human-Computer Studies. volume 212, pages 103814, 2026Ana Sanz CozcolluelaKoen WostenYasemin Vardar10.1016/j.ijhcs.2026.103814http://arxiv.org/abs/2304.10544v2What is the message? Perspectives on Visual Data Communication2026-05-28T08:42:02ZData visualizations are widely used to communicate messages about urgent topics such as climate change and public health. However, we still know little about how these visualizations are produced and interpreted in popular science contexts. In this mixed-method study, we examine how data are visually communicated and understood in the popular science magazine Scientific American, focusing on the messages these visualizations convey. To capture this complexity, we analyze data visualizations about climate change and pandemics in Scientific American over the past fifty years from three complementary perspectives: reader, chart, and producer. From the reader's perspective, we articulate takeaway messages and document sensemaking, interpreting visualizations first without and then with textual elements. From the chart perspective, we examine how visual features and text shape interpretation. From the producer's perspective, we draw on interviews with Scientific American staff to understand message planning and compare a sample of their intended messages with those we interpreted. Using takeaway messages as our central analytic lens, we develop a message typology and show that messages vary systematically across dimensions such as granularity, articulation, and inference. A key finding is that text plays a pivotal role: approximately two-thirds of messages change when textual elements are added. While the interviews highlighted the central role of message planning in visualization production, intended and interpreted messages only partially aligned. Our findings underscore the importance of contextual clarity and audience-aware communication, and we derive recommendations for visualization designers and science communicators.2023-04-12T15:22:17ZRegina SchusterKathleen GregoryChristian KnollTorsten MöllerLaura Koestenhttp://arxiv.org/abs/2605.29572v1Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models2026-05-28T08:20:01ZHuman tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.2026-05-28T08:20:01Z12 pages, 3 figures, journalLi ZouYasemin Vardarhttp://arxiv.org/abs/2605.29543v1SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring2026-05-28T07:56:24ZPilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.2026-05-28T07:56:24ZQihan DengMinghua ZhangYang YangZhenyu Gaohttp://arxiv.org/abs/2510.20743v2Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations2026-05-28T07:50:13ZWe present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.2025-10-23T17:08:03ZLorenzo StacchioAndrea UbaldiAlessandro GaldelliMaurizio MauriEmanuele FrontoniAndrea Gaggioli