https://arxiv.org/api/xD063qLhWdUHtb4MZZlrZtYeWl8 2026-06-13T12:28:22Z 30934 45 15 http://arxiv.org/abs/2508.20464v3 Human-Centered Design for Connected Automation: Predicting Pedestrian Crossing Intentions 2026-06-10T03:31:25Z More than half of the 1.19 million annual traffic fatalities globally involve vulnerable road users, such as pedestrians, with a significant proportion attributable to human error. Level-5 automated driving systems (ADSs) have the potential to reduce these incidents; However, their effectiveness depends not only on automation performance but also on their ability to communicate intent and coordinate safely with pedestrians in the absence of traditional driver cues. This study aims to model pedestrian decision-making in road-crossing scenarios involving level-5 ADSs by extending the Theory of Planned Behavior (TPB) with safety, trust, compatibility, and understanding. An online survey (n = 212) found that perceived behavioral control, attitude, and social information significantly influence pedestrians' crossing intentions, with perceived safety and understanding having the strongest effects on the TPB constructs. The results offer guidance for designing eHMIs and cooperative V2X communication strategies that promote safe pedestrian-ADS interactions and advance human-centered design for autonomous vehicles. 2025-08-28T06:31:03Z Sanaz Motamedi Viktoria Marcus Griffin Pitts http://arxiv.org/abs/2606.11613v1 Factions Within, Uncertain Across: Within-Document Reader Sub-Groups in Social Highlighting 2026-06-10T03:25:36Z When many people highlight the same document, is the crowd a single consensus, or is it internally structured into reader sub-groups that mark different things -- and is that structure a stable property of a reader or of the document? Building on prior work showing an individual's within-document highlighting signal is a whisper while individuality lives in selection, we ask the group-level question on a co-readership platform using a margin-preserving curveball null. Experiment 1: within a document, readers form strong sub-groups -- pairs agree far beyond what shared salience, mark density, and sentence popularity predict (nearest-neighbour agreement z=+6.3, significant in 88% of documents). Under an eight-block region-preserving null, shared engagement with the same coarse regions of the document accounts for about 40% of this excess; the majority survives as finer reader-specific agreement (z=+3.6, 77% significant). So the within-document crowd is, in a descriptive sense, factional. Experiment 2: is that grouping a stable reader trait? Here we are honest about power. The cross-document split-half reproducibility of a pair's agreement is near zero pooled (+0.078 and 0.000 in two separately drawn samples), and a power calibration shows the test is informative only for pairs that co-read many documents. In the only informative high-overlap subset (k>=4), point estimates are positive but small-sample, imprecise across the separately drawn samples, never significant, and attenuate under the region-preserving null. We therefore leave cross-document stability unresolved: the data is consistent with anything from situational grouping to a weak-to-moderate stable reader trait. The crowd is factional within a document; whether its factions follow the reader across documents is, honestly, beyond our reach. 2026-06-10T03:25:36Z 11 pages, 3 figures, 3 tables Kazuki Nakayashiki Keisuke Watanabe http://arxiv.org/abs/2606.10120v2 MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention 2026-06-10T02:51:13Z Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool. 2026-06-08T19:52:08Z Asiful Arefeen Carol Johnston Hassan Ghasemzadeh http://arxiv.org/abs/2510.02660v2 When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About? 2026-06-09T21:44:21Z When researchers claim AI systems possess ToM or mental models, they are fundamentally discussing behavioral predictions and bias corrections rather than genuine mental states. This position paper argues that the current discourse conflates sophisticated pattern matching with authentic cognition, missing a crucial distinction between simulation and experience. While recent studies show LLMs achieving human-level performance on ToM laboratory tasks, these results are based only on behavioral mimicry. More importantly, the entire testing paradigm may be flawed in applying individual human cognitive tests to AI systems, but assessing human cognition directly in the moment of human-AI interaction. I suggest shifting focus toward mutual ToM frameworks that acknowledge the simultaneous contributions of human cognition and AI algorithms, emphasizing the interaction dynamics, instead of testing AI in isolation. 2025-10-03T01:37:32Z This work have been accepted in CogInterp @ NeurIPS 2025 Xiaoyun Yin Elmira Zahmat Doost Shiwen Zhou Garima Arya Yadav Jamie C. Gorman http://arxiv.org/abs/2601.18934v2 Whispering Water: Materializing Human-AI Dialogue as Interactive Ripples 2026-06-09T21:13:51Z Water has long served as a recipient of human confession across cultures. We present \textit{Whispering Water}, an interactive installation that materializes human-AI dialogue through cymatic patterns on water. Participants confess to a water surface, triggering a four-phase ritual: confession, contemplation, response, and release. Speech sentiment is translated into excitation frequencies that prime the water's physical state, while semantic content enters a multi-agent system of heterogeneous LLMs whose identities emerge through situated discourse. A novel algorithm decomposes synthesized speech into harmonic components via logarithmic spacing and Bark-scale mapping, reconstructing machine voices as physical wave superpositions. The installation explores emotional self-exploration through sensory-rich, ritually framed human-AI interaction. 2026-01-26T20:12:53Z Ruipeng Wang Tawab Safi Yunge Wen Christina Cunningham Hoi Ling Tang Behnaz Farahi http://arxiv.org/abs/2606.11349v1 Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents 2026-06-09T18:25:57Z In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate. 2026-06-09T18:25:57Z Aijing Gao Yiming Kang Mengdie Flora Wang Jae Oh Woo http://arxiv.org/abs/2606.11336v1 Towards a Joint Understanding of Remote Operation for Vehicles in Public Road Traffic 2026-06-09T18:14:52Z Sustained driving automation systems are envisioned to be used as the foundation for driverless mobility services. However, both researchers and practitioners acknowledge that current driving automation systems are not yet able to handle all traffic situations that a human driver can handle. To bridge this gap and enable mobility services without an in-vehicle human driver or fallback, remote operation (or teleoperation) is increasingly discussed. Recently, first legal actions have been taken to enable some forms of remote operation on public roads. Remote operation encompasses a broad spectrum of methods to support a driving automation system, ranging from remote assistance, which includes providing information or releasing a maneuver, to remote driving, which includes driving the vehicle from a remote location. As such, safe implementation of remote operation in public road traffic challenges the collaboration of multiple academic disciplines (e.g. engineering, psychology, informatics, law, etc.) and stakeholders (e.g. remote operation service providers, remote operators, vehicle manufacturers, regulatory authorities, etc.). At the same time, the interdisciplinary discourse is often challenging due to differing expectations and language. To build a common ground, this article traces terminology back to the original differences in information processing both on human and vehicle side. This framework aims to help further discourse by directly specifying what is needed to engage a diverse audience including researchers and stakeholders of different backgrounds and interests. Recently discussed forms of teleoperation are integrated into this framework. 2026-06-09T18:14:52Z Elisabeth Shi Maria-Magdalena Wolf Nina Theobald Bettina Abendroth Eugen Wige Johannes Springer Katharina Hottelart Andreas Schrank Thorben Brandt Michael Oehl Frank Diermeyer Lena Plum http://arxiv.org/abs/2606.11176v1 Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories 2026-06-09T17:51:55Z Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io. 2026-06-09T17:51:55Z Project page: https://data2story.github.io Github: https://github.com/QinghongLin/data2story-skill Kevin Qinghong Lin Batu EI Yuhong Shi Pan Lu Philip Torr James Zou http://arxiv.org/abs/2606.11116v1 Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News 2026-06-09T17:13:40Z As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community. 2026-06-09T17:13:40Z Accepted to CHIWORK Workshop (Interrogating GenAI Augmentation for CHIworkers: Strategies for Professional Autonomy and Accountability) Pooja Prajod http://arxiv.org/abs/2509.19152v2 A Scoping Review of Mixed Initiative Visual Analytics in the Automation Renaissance 2026-06-09T16:53:06Z Artificial agents are increasingly integrated into data analysis workflows, carrying out tasks that were primarily done by humans. Our research explores how the introduction of automation recalibrates the dynamic between humans and automating technology. To explore this question, we conducted a scoping review encompassing twenty years of mixed-initiative visual analytic systems. To describe and contrast the relationship between humans and automation, we developed an integrated taxonomy to delineate the objectives of these mixed-initiative visual analytics tools, how much automation they support, and the assumed roles of humans. Here, we describe our qualitative approach of integrating existing theoretical frameworks with new codes we developed. Our analysis shows that the visualization research literature lacks consensus on the definition of mixed-initiative systems and explores a limited potential of the collaborative interaction landscape between people and automation. Our research provides a scaffold to advance the discussion of human-AI collaboration during visual data analysis. Our integrated taxonomy is available in the form of a web application on https://smonadjemi.github.io/miva. 2025-09-23T15:30:34Z Shayan Monadjemi Yuhan Guo Kai Xu Alex Endert Anamaria Crisan 10.1111/cgf.70434 http://arxiv.org/abs/2606.11051v1 Making Software Meaningful 2026-06-09T16:16:11Z Adopting a single measure can improve the usability, modularity and accountability of software: a commitment to explicit meaning. This entails constructing and agreeing upon a representation of the behavior of the software, as observed in the domain of application. The phenomena comprising this behavior become a vocabulary that grounds all discourse about the software, among all stakeholders, and for all artifacts and activities. These phenomena are individuals; actions they participate in; and facts that result from actions. They can be organized, by partitioning the set of actions, into concepts, offering larger units of meaning. Examples of exploiting meaning are given in three areas: designing for usability (by aligning user and designer on a single shared meaning); generating modular code with LLMs (by mapping units of meaning to units of code, achieving not only modularity but also legibility); and making agents accountable (by having them adhere to a code of conduct that defines their intended behavior). 2026-06-09T16:16:11Z Eagon Meng Abutalib Namazov Carmel Schare Alcino Cunha Daniel Jackson http://arxiv.org/abs/2606.11004v1 A Case Study Reexamining the Cold-Start Problem in Knowledge Tracing Models and Implications for SafeInsights, an Education Research Infrastructure 2026-06-09T15:39:46Z Knowledge tracing (KT) models are widely used to predict students' evolving knowledge states from their learning history. However, many KT models are evaluated using specific datasets, platforms, and learning contexts, raising questions about whether reported model performance replicates and generalizes across newer datasets that vary in context. This paper replicates and extends Zhang et al. (2021), which examined the cold-start problem in KT models and found that deep-learning-based KT models performed better, partly because of stronger predictions when students began practicing a skill. Using a more recent ASSISTments dataset, FoundationalASSIST, we replicate the previous analysis by evaluating model performance across opportunities to practice and extend the analysis by examining performance across problem types, including fill-in-the-blank, multiple-choice select-one, multiple-choice select-all, and order/sort problems. Results show that KT model performance varies across both student practice trajectories and problem types. Beyond the empirical replication, this study identifies practical challenges in reproducing educational data mining studies and serves as a proof of concept, showing how privacy-preserving research infrastructures such as SafeInsights can be leveraged to facilitate educational research and support replication analyses. 2026-06-09T15:39:46Z Jiayi Zhang Ryan S. Baker Debshila Basu Mallick Cristina Heffernan Neil Heffernan http://arxiv.org/abs/2601.16700v2 Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study 2026-06-09T15:39:26Z Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools, including the depth of interaction, organizational constraints, and experience-related considerations, have not been thoroughly investigated. This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany, where practitioners must address the GDPR and the EU AI Act while balancing productivity gains with intellectual property considerations. Despite the significant impact of GenAI on software engineering, to the best of our knowledge, no empirical study has systematically examined the adoption dynamics of GenAI tools within the German context. To address this gap, we present a comprehensive mixed-methods study on GenAI adoption among German software engineers. Specifically, we conducted 18 exploratory interviews with practitioners, followed by a developer survey with 109 participants. We analyze patterns of tool adoption, prompting strategies, and organizational factors that influence effectiveness. Our results indicate that experience level moderates the perceived benefits of GenAI tools, and productivity gains are not evenly distributed among developers. Further, organizational size affects both tool selection and the intensity of tool use. Limited awareness of the project context is identified as the most significant barrier. We summarize a set of actionable implications for developers, organizations, and tool vendors seeking to advance artificial intelligence (AI) assisted software development. 2026-01-23T12:42:33Z Accepted at FSE '26 Ludwig Felder Tobias Eisenreich Mahsa Fischer Stefan Wagner Chunyang Chen 10.1145/3803437.3805207 http://arxiv.org/abs/2606.09570v2 UXBench: Benchmarking User Experience in AI Assistants 2026-06-09T13:55:26Z As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants. 2026-06-08T14:44:01Z Mengze Hong Xia Zeng Zeyang Lei Sheng Wang Chen Jason Zhang Di Jiang Taiming Fu Jinfeng Huang Mengqiao Liu Qinghe Chang Haosheng Zou Qiongyi Zhou Sijun He Simonjmdeng Haojing Huang Zijian Li Lucas Mu Li Fubao Zhang Mona Zhou Wei Ma Chenxuan Ma Yuanmeng Zhang Jian Song Minlong Peng Di Liang Davey Chen http://arxiv.org/abs/2606.10861v1 From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot 2026-06-09T13:39:21Z LLM-powered chatbots are increasingly embedded in everyday workflows, raising sustainability concerns due to their energy use. Most mitigation strategies emphasize model or infrastructure efficiency, while the user-interface (UI) layer remains underexplored despite its potential to shape interaction behavior. We investigate whether sustainability-oriented UI interventions can increase users' energy awareness and encourage more energy-responsible chatbot use without reducing usability. We first conducted a baseline survey with 77 participants to assess awareness and receptiveness to intervention concepts. Guided by prior work on persuasive technology and choice architecture, we implemented a web-based chatbot prototype with a three-mode switch (Energy-efficient, Balanced, Performance), per-response energy feedback, pre-send energy estimates, a usage metrics dashboard, and energy analogies. We then evaluated the prototype in a five-day field study with 11 participants. In the baseline survey, 94.8% of respondents reported at least some awareness of AI energy use, yet 88.3% misestimated actual consumption. Although concern about environmental impact was high, only 39.0% indicated willingness to accept a performance trade-off for lower energy use. In the field study, Energy-efficient mode accounted for 55.8% of logged prompts, while 90.9% self-reported actively choosing Eco-mode when high accuracy was not required. Participants did not reduce prompt length, suggesting mode switching as the primary behavioral mechanism. Sustainability-oriented UI interventions can improve awareness and support more energy-responsible interaction patterns in LLM chatbots. These effects are best interpreted as behavioral and model-based estimates that complement backend efficiency work, and the provided prototype and replication package support further research on energy-aware conversational AI design. 2026-06-09T13:39:21Z Nitish Patkar Pooja Rani Jack Glässer Simon Lüscher Martin Kropp