https://arxiv.org/api/XG/ZjuXjzhrPzqdYkmcTedC0L0U2026-06-14T08:37:05Z3093431515http://arxiv.org/abs/2605.31026v1From Statistics to Individuals: An Exploration of Zoomable Empathic Visualizations2026-05-29T08:57:31ZData visualization is a powerful tool for conveying statistical information, but when representing populations, it tends to hide individuals. We introduce Zoomable Empathic Visualizations (ZEVs), interactive experiences allowing users to smoothly navigate between abstract statistical visualizations and more qualitative, relatable representations focused on individuals. We present three use cases of ZEVs and report on a qualitative user study that highlights opportunities for deeper understanding and emotional engagement, while pointing to areas for improvement and further refinement. In summary, ZEVs point toward new approaches for revealing the individuals behind the data.2026-05-29T08:57:31ZEdwige ChauvergneILDAArnaud ProuzeauILDAMartin HachetBIVWAC, InriaPierre DragicevicBIVWAChttp://arxiv.org/abs/2602.01959v2Boosting metacognition in entangled human-AI interaction to navigate cognitive-behavioral drift2026-05-29T08:49:09ZPeople navigate complex environments using cues, heuristics, and other strategies, which are often adaptive in stable settings. However, as AI increasingly permeates society's information environments, those become more adaptive and evolving: LLM-based chatbots participate in extended interaction, maintain conversational histories, mirror social cues, and can hypercustomize responses, thereby shaping not only what information is accessed but how questions are framed, how evidence is interpreted, and when action feels warranted. Here we propose a framework for sustained human-AI interaction that rests on invariant features of human cognition and human--AI interaction and centers on three interlinked phenomena: entanglement between users and AI systems, the emergence of cognitive and behavioral drift over repeated interactions, and the role of metacognition in the awareness and regulation of these dynamics. As conversational agents provide cues (e.g., fluency, coherence, responsiveness) that people treat as informative, subjective confidence and action readiness may increase without corresponding gains in epistemic reliability, making drift difficult to detect and correct. We describe these dynamics across micro-, meso-, and macro-levels. The framework identifies four metacognitive intervention points and psychologically informed interventions that provide metacognitive scaffolding (boosting and self-nudging). Finally, we outline a long-horizon research agenda for scientific foresight.2026-02-02T11:04:45ZEzequiel Lopez-LopezChristoph M. AbelsPhilipp Lorenz-SpreenStephan LewandowskyStefan M. Herzoghttp://arxiv.org/abs/2605.30930v1TUX: Measuring Human--AI Tacit Understanding2026-05-29T07:19:58ZAs large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.2026-05-29T07:19:58ZYueshen LiHanyi MinVedant Das SwainKoustuv Sahahttp://arxiv.org/abs/2605.28916v2First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope2026-05-29T07:07:56ZWe report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.2026-05-27T17:54:26ZVersion 2; includes the report autonomoulsy written in PRD style by agentic AI systems as supplemental materialGianluca Ingugliahttp://arxiv.org/abs/2605.30913v1Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits2026-05-29T06:58:47ZLarge language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.2026-05-29T06:58:47ZSoorya Ram ShimgekarAgam GoyalAmruta ParulekarJoshua ChenYian WangNavin KumarHari SundaramEshwar ChandrasekharanKoustuv Sahahttp://arxiv.org/abs/2605.30864v1What makes an action sequence enjoyable to watch?2026-05-29T05:46:12ZPeople often seek out ways to watch others perform complex action sequences (e.g., sports). What makes some sequences more enjoyable to watch than others? We generated 24 video clips of gameplay from a Flappy Bird-style video game. Clips varied in difficulty (how often players succeeded on average) and in moment-to-moment uncertainty (how likely the player was to crash at any given step). Participants (N=864) rated each video on one of three dimensions: how much they enjoyed it, how difficult the level appeared, or how dangerous the player's trajectory appeared. We found that participants preferred videos where the player seemed to be completing more difficult obstacle courses, but dangerousness did not predict enjoyment ratings. These findings show how procedurally generated stimuli can isolate the factors that affect how enjoyable an action sequence is to watch.2026-05-29T05:46:12Z6 pages, 4 figures, cogsci 2026Jean-Peïc ChouKristine ZhengJunyi ChuManeesh AgrawalaJudith E. Fanhttp://arxiv.org/abs/2602.10324v2Discovering Differences in Strategic Behavior Between Humans and LLMs2026-05-29T05:21:58ZAs Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.2026-02-10T22:02:41ZAccepted to ICML 2026Caroline WangDaniel KasenbergKim StachenfeldPablo Samuel Castrohttp://arxiv.org/abs/2509.13323v2AI Behavioral Science2026-05-29T04:50:24ZWe outline a foundation for a new field of ``AI Behavioral Science,'' covering three perspectives. First, as AI becomes ubiquitous and is increasingly proprietary and opaque, it becomes vital to develop techniques for assessing AI behavior. We outline how tools developed to assess people's behaviors by social scientists can be used to assess and infer AI's behaviors biases, tendencies, and heuristics. Second, we also discuss how AI can change the ways in which we learn about human behavior. Beyond its computational power, AI offers new techniques for simulating, inferring, and predicting human behaviors that we outline and discuss. Third, as humans and AI are interacting in increasingly complex and intertwined systems, we need to understand the implications for the resulting economic and political outcomes. We outline issues that are increasingly pressing concerning the future of human-AI interactions and potential changes and disruptions that can ensue.2025-08-17T11:24:31ZMatthew O. JacksonQiaozhu MeStephanie W. WangYutong XieWalter YuanSeth BenzellErik BrynjolfssonColin F. CamererJames EvansBrian JabarianJon KleinbergJuanjuan MengSendhil MullainathanAsuman OzdaglarThomas PfeifferMoshe TennenholtzRobb WillerDiyi YangTeng Yehttp://arxiv.org/abs/2605.30800v1Computer-Aided Tagging on Wikimedia Commons: Designing for Human-AI Collaboration in Open Knowledge Work2026-05-29T03:42:22ZThis study investigates Wikimedia Commons contributors' lived experiences with the Computer-Aided Tagging (CAT) tool, an AI-assisted image tagging system designed to improve Commons' discoverability, searchability, accessibility, and multilingual support. Using a qualitative analysis of 595 CAT-related community comments from 11 wiki pages and 16 in-depth interviews, we identify seven key issues that contributed to CAT's mixed reception and eventual deactivation. We also offer community-informed suggestions for improving the tool. We reflect on the implications for designing human-AI collaboration on Commons and for developing AI-assisted tools that support open knowledge work. This work contributes to HCI and CSCW research by extending the understanding of human-AI collaboration beyond Anglophone, text-centric, corporate platforms.2026-05-29T03:42:22ZAccepted to CSCW 2026, to appearYihan YuDavid W. McDonaldhttp://arxiv.org/abs/2605.30685v1How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language2026-05-29T00:28:36ZAI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.2026-05-29T00:28:36ZMadeleine I. G. DaeppIsaac Slaughterhttp://arxiv.org/abs/2605.30654v1EUDAIMONIA: Evaluating Undesirable Dynamics in AI2026-05-28T23:17:26ZLarge language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.2026-05-28T23:17:26ZJun Rui HuangWang Bill ZhuZiyi LiuNathanael FastRavi IyerRobin Jiahttp://arxiv.org/abs/2605.30632v1Rationalize: Shared Semantic Reasoning for Human-AI Alignment2026-05-28T22:34:28ZWe introduce Rationalize, a role-pair framework for shared semantic reasoning between humans and AI models in data-driven sensemaking. Building on ideas in human-machine teaming and critical thinking, we conceptualize human-AI interaction as a series of complementary role pairs (Explorer-Guide, Investigator-Informant, Teacher-Student, Judge-Advocate) operating in a shared reasoning space. In this space, human analysts and AI models (such as LLMs) make purposes, questions, assumptions, evidence, inferences, and implications explicit, facilitating alignment not only at the output level but at the level of rationalization of intent and action by each side. We relate these role pairs to the bidirectional human-AI alignment framework, illustrating how "aligning AI to humans" and "aligning humans to AI" differ by role, and sketch a collaborative research agenda for alignment design and assessment using element-level and role-specific approaches.2026-05-28T22:34:28ZAccepted by ACM CHI 2026 BiAlign WorkshopAritra DasguptaNaga Datha Saikiran BattulaAvina NakarmiSohom SenSubhodeep GhoshXun Songhttp://arxiv.org/abs/2601.11702v3PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation2026-05-28T21:56:48ZAI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies creates substantial burdens for resource-constrained practitioners lacking policy expertise. Existing approaches typically address one policy at a time, making multi-policy compliance costly. We present PASTA, a scalable compliance tool integrating four innovations: (1) a comprehensive model-card format supporting descriptive inputs across development stages; (2) a policy normalization scheme; (3) an efficient LLM-powered pairwise evaluation engine with cost-saving strategies; and (4) an interface delivering interpretable evaluations via compliance heatmaps and actionable recommendations. Expert evaluation shows PASTA's judgments closely align with human experts ($ρ\geq .626$). The system evaluates five major policies in under two minutes at approximately \$3. A user study (N = 12) confirms practitioners found outputs easy-to-understand and actionable, introducing a novel framework for scalable automated AI governance.2026-01-16T18:56:39Z28 pages, 7 figuresYu YangIg-Jae KimDongwook Yoonhttp://arxiv.org/abs/2602.19296v2A Causal Framework for Estimating Heterogeneous Effects of On-Demand Tutoring2026-05-28T18:03:51ZThis paper introduces a scalable causal inference framework for estimating the immediate, session-level effects of on-demand human tutoring embedded within adaptive learning systems. Because students seek assistance at moments of difficulty, conventional evaluation is confounded by self-selection and time-varying knowledge states. We address these challenges by integrating principled analytic sample construction with Deep Knowledge Tracing (DKT) to estimate latent mastery, followed by doubly robust estimation using Causal Forests. Applying this framework to over 5,000 middle-school mathematics tutoring sessions, we find that requesting human tutoring increases next-problem correctness by approximately 4 percentage points and accuracy on the subsequent skill encountered by approximately 3 percentage points, suggesting that the effects of tutoring have proximal transfer across knowledge components. This effect is robust to various forms of model specification and potential unmeasured confounders. Notably, these effects exhibit significant heterogeneity across sessions and students, with session-level effect estimates ranging from $-20.25pp$ to $+19.91pp$. Our follow-up analyses suggest that typical behavioral indicators, such as student talk time, do not consistently correlate with high-impact sessions. Furthermore, treatment effects are larger for students with lower prior mastery and slightly smaller for low-SES students. This framework offers a rigorous, practical template for the evaluation and continuous improvement of on-demand human tutoring, with direct applications for emerging AI tutoring systems.2026-02-22T18:10:36ZKirk VanacoreDanielle R ThomasDigory SmithBibi GrootJustin ReichRene Kizilcechttp://arxiv.org/abs/2605.30353v1Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software2026-05-28T17:59:59ZAre AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level.
The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology.
The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]2026-05-28T17:59:59Z10 pages, 2 figures, 2 tables, 1 physicist and a few AI agents. Accepted by ICML 2026 AI for Science Workshop. Code and development log are available at this repo: https://github.com/MinhMPA/clax-ptNhat-Minh Nguyen