https://arxiv.org/api/uLgVgVJuV7C/kmKa/yW+xEvcbIw 2026-06-13T20:47:10Z 30934 150 15 http://arxiv.org/abs/2606.06702v1 Adversarial Co-Thinking: Calibration and Triangulation Across Multiple GenAI Tools in HCI Writing 2026-06-04T20:41:16Z

This paper examines what happens when GenAI tools are fully embedded in the drafting of an academic paper rather than confined to late-stage polishing. To investigate how an intensive multi-tool GenAI workflow differs from conventional academic writing, I drafted this paper from the first sentence in parallel with three GenAI tools - Claude, ChatGPT, and Gemini - comparing their outputs against my own intended contribution. Across this process, a recurring pattern took shape that I call adversarial co-thinking: using past peer reviews to calibrate the tools, then setting their outputs against one another to be tested rather than deferred to. I argue that surfacing genuine critique from tools that default to praise is a central practical challenge of working with these tools, and that the skill at stake is evaluative rather than generative. Adversarial co-thinking is a high-skill epistemic practice: it can amplify expertise where it exists, but it can also mask its absence. I further argue that current disclosure frameworks are poorly equipped to capture this shift. The paper offers four propositions for workshop discussion concerning autonomy, supervision, equity of access, and disclosure.

2026-06-04T20:41:16Z Pia Tukkinen http://arxiv.org/abs/2606.06650v1 LinkNav: Surfacing Interconnected Information in Scientific Articles 2026-06-04T18:58:26Z

We present LinkNav, an enhanced experience for reading academic papers which makes explicit connections between related but non-adjacent passages. To create the experience, we instruct a language model to generate questions that may arise while reading a passage and then search for answer passages elsewhere in the document, forming intra-document connections when answers are found. We confirm that these building blocks work well to power the experience, with an answer detection pipeline that works with high precision, resulting in a reasonable number of connections being made for a document. On a dataset of academic papers, we find that connected passages are on average ten segments away from each other, making explicit connections that a reader may have otherwise missed.

2026-06-04T18:58:26Z 10 pages, 3 figures, ACL 2026 (Demo Track) Sebastian Joseph Jennifer Healey Junyi Jessy Li Ani Nenkova http://arxiv.org/abs/2606.06614v1 Re-Centering Humans in LLM Personalization 2026-06-04T18:04:47Z

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

2026-06-04T18:04:47Z Lechen Zhang Jiarui Liu Tal August http://arxiv.org/abs/2606.06429v1 Computational Modeling of Human Adaptation in Urban Infrastructure Management under Extreme Conditions: A Case Study of Subway Flood Scenarios 2026-06-04T17:34:19Z

Decision-making in urban infrastructure management during extreme events relies heavily on human operators, yet current computational support systems often fail to account for non-monotonic human adaptation and latent psychological biases like overconfidence and defensive overcorrection. This study addresses this gap by integrating Instance-Based Learning Theory (IBLT) into the domain of civil engineering computing. We establish a computational cognitive architecture that simulates operator decision processes through the mathematical mechanisms of memory retrieval and utility blending. This model functions as a computational baseline, representing boundedly rational adaptation driven by experiential priors, thus allowing for the algorithmic isolation of latent psychological biases from the baseline dynamics of memory-based learning. We demonstrated this framework using a human-in-the-loop microworld experiment simulating subway flood-induced track suspensions, where dispatchers must balance passenger safety against service efficiency. Analysis revealed a complex, non-linear human adaptation cycle consisting of four phases: acquisition, overconfidence, overcorrection, and recalibration. Specifically, the computational model exposed a significant divergence during the post-accident "overcorrection" phase: while human operators exhibited immediate, defensive risk overestimation, the model maintained a stable trajectory based on accumulated experience. This strategic divergence confirms that operational instability following failure is often attributable to acute psychological bias overriding stable memory-based adaptation, a pattern theoretically expected to recur across analogous high-stakes environments and validatable through multi-modal behavioral and sensor data from professional operators.

2026-06-04T17:34:19Z Jinfeng Lou Zijie Liang Pengkun Liu Yuxin Zhang Cleotilde Gonzalez Pingbo Tang http://arxiv.org/abs/2606.06565v1 AI Level of Detail: Distance-Aware ML Model Precision Selection for Real-Time Human Motion Prediction in Games 2026-06-04T17:27:00Z

Modern game engines spend significant compute animating NPCs with learned motion models. This paper proposes AI Level of Detail (AI LOD), a framework in which machine learning inference precision is adapted based on the distance between each NPC and the player camera. The core idea mirrors classical geometry LOD: substitute a cheaper approximation where the difference is imperceptible. Here, the approximation is a lower-precision quantized machine learning model rather than a lower-polygon mesh. The contribution of this work is the AI LOD concept itself: that inference-time quantization can serve as the LOD axis for AI-driven character animation - and more broadly, for any AI-based runtime system where perceptual sensitivity varies with context. The convolutional sequence-to-sequence model of Li et al. is used as a representative example to demonstrate the concept, with its trained checkpoint exported into three ONNX Runtime variants (FP32, FP16, and INT8 per-tensor), intended to be routed by a distance-based selector at runtime. Evaluation on the CMU Mocap dataset provides initial evidence that each precision tier can be served at its assigned distance range with negligible perceptible degradation, supporting the broader premise that distance-aware ML model precision selection is a viable LOD strategy for AI-based character animation.

2026-06-04T17:27:00Z Camera-ready for SIGGRAPH Technical Workshops 2026 Mathew Varghese http://arxiv.org/abs/2606.06297v1 A MATLAB Toolbox for Standardized Reading Speed Assessment: Implementing and Extending the Perrin Sentence Generator for English Corpora 2026-06-04T15:36:33Z

In the fields of vision science, cognitive psychology, and psycholinguistics, the accurate measurement of reading speed is frequently hampered by the limitations of static reading charts. Repeated testing often leads to memorization effects, while the requirement for oral recitation introduces speech-motor confounds that obscure true information processing speed. To address these methodological hurdles, this paper introduces an open-source MATLAB toolbox that adapts the sentence generation paradigm originally proposed by Perrin, Paillé, and Baccino (2014) for the English language. This system utilizes a semantic ontology and a "proto-truth" logic to autonomously generate thousands of unique, grammatically simple sentences with unambiguous truth values. Beyond the original scope of Maximum Reading Speed (MRS) measurement, this implementation introduces band-pass psycholinguistic filtering and specific logic to resolve semantic ambiguities unique to English. We present this complete software package as an open platform for the scientific community to validate and refine.

2026-06-04T15:36:33Z Daniel P. Spiegel Romain Bachy http://arxiv.org/abs/2606.06271v1 FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays 2026-06-04T15:13:25Z

While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.

2026-06-04T15:13:25Z Yijun Liu Yifan Song John Gallagher Sarah Sterman Tal August http://arxiv.org/abs/2606.06560v1 MacArena: Benchmarking Computer Use Agents on an Online macOS Environment 2026-06-04T14:01:32Z

Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.

2026-06-04T14:01:32Z Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026 Victor Muryn Maksym Shamrai Sofiia Mazepa Yehor Khodysko http://arxiv.org/abs/2606.06177v1 Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios 2026-06-04T13:52:21Z

Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

2026-06-04T13:52:21Z Code and data at https://github.com/g8a9/ouvia Giuseppe Attanasio Beatrice Savoldi Daniel Chechelnitsky Matteo Negri Marine Carpuat Maarten Sap André F. T. Martins http://arxiv.org/abs/2606.06081v1 A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice 2026-06-04T12:17:42Z

Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.

2026-06-04T12:17:42Z Ranjan Mishra Jakob Schoeffer http://arxiv.org/abs/2606.05995v1 Empathy on Demand: How Empathic AI Can Scale Emotional Support for Verbal Harassment 2026-06-04T10:42:55Z

Verbal harassment is a growing source of psychological stress for people around the world. It occurs both online and offline and relies on language to demean, threaten, or discredit its targets. Unlike other stressors such as loss or uncertainty, verbal harassment aims at silencing its targets by eroding their sense of being heard and weakening their perceived ability to respond. Many individuals lack access to adequate and timely support, however, when they experience such harassment. People increasingly turn to conversational artificial intelligence (AI) such as ChatGPT or dedicated AI companions for emotional support, raising questions about whether it can facilitate the same psychological benefits as actual human empathy. We focus on online contexts as a prevalent application of verbal harassment. We develop and test a psychological framework identifying three key linguistic signals of empathic listening (perspective-taking, emotional validation, and action orientation), that together restore a sense of feeling heard and enhance coping in the context of verbal harassment. We find that LLMs consistently produce language exhibiting stronger empathic-listening markers than human non-experts and trained mental health professionals, promoting more approach-oriented (vs. avoidance-oriented) coping strategies. A subsequent behavioral study shows that these linguistic signals boost recipients' sense of feeling heard and increase their coping self-efficacy. These findings reveal how specific linguistic features create empathic connections between humans and advanced conversational AI and can enhance people's psychological resilience. Our results highlight the potential for AI to serve as a scalable source of emotional support, especially when human support is unavailable or insufficient.

2026-06-04T10:42:55Z Anouk Bergner Philipp Winder Christian Hildebrand http://arxiv.org/abs/2606.05855v1 EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction 2026-06-04T08:28:31Z

Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional evolution.To address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified architecture.Specifically, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local fitting.Extensive experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.

2026-06-04T08:28:31Z 51 pages, 9 figures, 13 tables Zhihao Zhou Weishan Ye Li Zhang Gan Huang Zhen Liang http://arxiv.org/abs/2606.05826v1 Architecting Strategic Influence: Operationalising the UXR Point of View Framework for Research Function Maturity 2026-06-04T08:05:42Z

This case study illustrates that the systematic application of the User Experience Research (UXR) Point of View (POV) framework serves as an effective operational scaffolding for a UXR function undergoing the critical transition from incubation to maturity. By assimilating structured 'Offensive' and 'Defensive' strategies, the presented Playbook equips UXR leaders with an adaptable toolkit to systematically navigate common institutional barriers, such as stakeholder bias, reactive tasking, and insight fragmentation. By pre-emptive and purposeful application of growth strategies, the likelihood of the research function establishing itself as a strategic partner capable of delivering evidence-based, actionable perspectives is significantly enhanced. The analysis demonstrates how this deliberate, Playbook-driven maturity strategy empowers research functions to move beyond tactical execution and directly shape long-term business strategy.

2026-06-04T08:05:42Z Rohinin Singh Renee Barsoum http://arxiv.org/abs/2603.14805v2 Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development 2026-06-04T05:26:52Z

Enterprise software organizations accumulate critical institutional knowledge - architectural decisions, deployment procedures, compliance policies, incident playbooks - yet this knowledge remains trapped in formats designed for human interpretation. The bottleneck to effective agentic software development is not model capability but knowledge architecture. When any knowledge consumer - an autonomous AI agent, a newly onboarded engineer, or a senior developer - encounters an enterprise task without institutional context, the result is guesswork, correction cascades, and a disproportionate tax on senior engineers who must manually supply what others cannot infer. This paper introduces Knowledge Activation, a framework that specializes AI Skills - the open standard for agent-consumable knowledge - into structured, governance-aware Atomic Knowledge Units (AKUs) for institutional knowledge delivery. Rather than retrieving documents for interpretation, AKUs deliver action - ready specifications encoding what to do, which tools to use, what constraints to respect, and where to go next - so that agents act correctly and engineers receive institutionally grounded guidance without reconstructing organizational context from scratch. AKUs form a composable knowledge graph that agents traverse at runtime - compressing onboarding, reducing cross - team friction, and eliminating correction cascades. The paper formalizes the resource constraints that make this architecture necessary, specifies the AKU schema and deployment architecture, and grounds long - term maintenance in knowledge commons practice. A Yahoo deployment surveying 67 engineers shows statistically significant developer-experience gains - 2.6 hours per week saved, Net Promoter Score +35. Organizations that architect their institutional knowledge for the agentic era will outperform those that invest solely in model capability.

2026-03-16T04:10:35Z Preprint. 59 pages, 11 figures. v2 is a major revision: adds an enterprise case study (a Yahoo deployment evaluated by an anonymous 67-engineer survey), with findings integrated into the abstract, introduction, discussion, and conclusion; methodology tightened and references expanded Gal Bakal http://arxiv.org/abs/2606.05667v1 Sustainability by Design in Decentralized Autonomous Organizations: An Empirical Review of Governance, Innovation, and Institutional Design 2026-06-04T03:49:28Z

Recent innovation theories on economics remain largely grounded in assumptions of hierarchical firms and closed organizational boundaries, offering limited insight into how innovation unfolds within decentralized, digitally native organizations. Decentralized Autonomous Organizations (DAOs) represent an emerging form of innovation ecosystem characterized by blockchain-based transparency, open participation, and token-driven governance, in which sustainability can be embedded directly into organizational design. This study compares two standards, ERC-8004 and Google A2A, who address the same agent interoperability question, while the former is governed by DAO and the latter by corporation consortium. They are examined through an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures. The study provides evidence-based insights for scholars, policymakers, and designers seeking to align innovation, technological governance, and sustainability in future organizational forms.

2026-06-04T03:49:28Z Yutian Wang Luyao Zhang