https://arxiv.org/api/JUBnssRUgifZ6rCrzxXOoYV0qgc 2026-06-14T14:09:00Z 30934 390 15 http://arxiv.org/abs/2605.27554v1 What Catches the Eye? A Conjoint Study of Infographic Design Preferences 2026-05-26T18:23:07Z

Infographic designers balance many choices at once: chart type, color, and whether to add a benchmark or a scale. Past work studies these factors one at a time, so we know little about how readers weigh them against each other. We address this gap with a choice-based conjoint study (N = 65) in which participants viewed pairs of infographics on a mock newspaper page about unemployment. Each infographic varied across three attributes: comparison type (none, US average, percentage scale), color (red, blue), and graphic type (single icon, icon series, bar chart). Comparison type drove most of the preference variation (58.5%), followed by graphic type (29.2%) and color (12.3%). Readers favored percentage scale markers and benchmark comparisons; color had no practical effect. The percentage scale level adds axis information rather than a benchmark, so the comparison type result mixes two distinct ideas. A single topic and a narrow palette also limit external validity. We argue that conjoint analysis is a practical and underused tool for studying visualization preferences across many design dimensions.

2026-05-26T18:23:07Z Amit Kumar Das Karanbir Pelia Manav Nitesh Ukani Klaus Mueller http://arxiv.org/abs/2605.27546v1 Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies 2026-05-26T18:16:29Z

Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

2026-05-26T18:16:29Z Abeer Badawi Will Aitken Lydia Sequeira Jocelyn Rankin Maia Norman Elham Dolatabadi http://arxiv.org/abs/2601.07085v2 The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance 2026-05-26T17:53:46Z

Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.

2026-01-11T22:28:56Z 16 pages, 20 references. v2: Added brief discussion situating "honest signals" terminology in evolutionary biology (Sec. 3), with two added citations (Zahavi 1975; Maynard Smith & Harper 2003). No changes to argument or conclusions Andrew D. Maynard http://arxiv.org/abs/2605.27299v1 Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models 2026-05-26T17:11:21Z

Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.

2026-05-26T17:11:21Z Murat Moran http://arxiv.org/abs/2605.27261v1 Atari Games Challenge: A Pilot Study on Multimodal Player Experience Assessment 2026-05-26T16:39:18Z

We present a pilot study on the collection and synchronisation of multimodal data for player experience investigation. We collected game telemetry, self-reported surveys, biometrics, and cued-retrospective think-aloud (C-RTA) data from 19 participants playing three Atari 2600 games. The study then uses the data to investigate difficulty in PX, showcasing a protocol for future multimodal research. The dataset obtained from the experiment, which is publicly available, shows potential as a rich, transformative source that can be used to investigate dynamic difficulty adjustment algorithms, game balancing strategies or broader explorations of games user research. The study findings suggest that the experimental approach holds strong potential for generalisation in future player experience studies.

2026-05-26T16:39:18Z Oleg Jarma Montoya Erica Manca Thomas Vase Schultz Volden Paolo Burelli http://arxiv.org/abs/2606.00106v1 A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces 2026-05-26T16:13:15Z

Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by α, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning α yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.

2026-05-26T16:13:15Z Javier Jiménez Francisco B Rodríguez http://arxiv.org/abs/2510.00902v2 Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification 2026-05-26T15:31:35Z

Transfer learning is crucial for medical imaging, yet the selection of source datasets often relies on researchers' intuition rather than systematic principles, which can impact the generalizability of algorithms and, thus, patient outcomes. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-computer interaction (HCI) perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Moreover, ethical and fairness considerations remain largely absent from source dataset sections. Participants often used ambiguous terminology, which suggests a need for clearer definitions and tools to make them explicit and usable. By clarifying these heuristics and introducing a conceptual framework of transfer learning factors, this work provides practical insights for more systematic source selection in transfer learning.

2025-10-01T13:44:46Z Under review Yucheng Lu Hubert Dariusz Zając Veronika Cheplygina Amelia Jiménez-Sánchez http://arxiv.org/abs/2510.10774v3 ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis 2026-05-26T13:43:37Z

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

2025-10-12T19:33:11Z Mohammad Javad Ranjbar Kalahroodi Heshaam Faili Azadeh Shakery http://arxiv.org/abs/2605.15850v2 Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education 2026-05-26T11:31:22Z

In recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 higher education students, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, thus outperforming both approaches without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical strategy that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.

2026-05-15T11:02:16Z Janne Rotter Pau Benazet i Montobbio Davinia Hernández-Leo http://arxiv.org/abs/2605.26870v1 Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study 2026-05-26T11:28:36Z

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.

2026-05-26T11:28:36Z 19 pages, 2 figures, 3 main tables; supplementary appendix with 6 tables, 2 figures, and a reproducibility methods section. Describes 17 configured agents in a persistent research environment and introduces the PARE-M (Persistent Agentic Research Environment Measurement) framework Anas H. Alzahrani http://arxiv.org/abs/2605.26858v1 Rethinking AI Psychosis: Misnomers, Conceptual Limits, and Existential Drift 2026-05-26T11:19:08Z

There has been a proliferation of media reports about so-called AI psychosis in the last year. Not surprisingly, this has prompted growing academic work on the ways in which AI chatbots such as ChatGPT, Claude, and Replika might aggravate or even induce psychosis, typically understood in terms of users acquiring or maintaining delusional beliefs. Our paper consists of two parts. First, we provide a number of reasons to be sceptical about understanding 'AI psychosis' as a novel psychiatric category. We argue that many of the purportedly new phenomena are better understood through Stompe et al.'s (2003) metaphor of 'old wine in new bottles' and highlight conceptual, nosological, clinical, and social risks associated with the uncritical adoption of this terminology. Second, we develop a positive phenomenological account of what may nevertheless be at stake in sustained human-AI interaction. Rather than focusing primarily on whether AI systems induce, amplify, or sediment delusional beliefs, we examine how conversational AI may participate in transforming a person's lived experience of reality itself. We claim that the sycophantic and pseudo-intersubjective nature of AI could lead to what we call "existential drift", whereby individuals may continue to feel rooted in a shared reality through their interactions with AI, while actually becoming entrenched in increasingly private and subjective worlds.

2026-05-26T11:19:08Z Kasper Møller Nielsen Lucy Osler http://arxiv.org/abs/2604.11467v2 From Attribution to Action: A Human-Centered Application of Activation Steering 2026-05-26T10:03:53Z

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

2026-04-13T13:41:57Z Tobias Labarta Maximilian Dreyer Katharina Weitz Wojciech Samek Sebastian Lapuschkin http://arxiv.org/abs/2605.26782v1 Manipulating Tangible Virtual Object Dynamics to Promote Learning of Precision Force Generation 2026-05-26T09:51:49Z

Robotic haptic devices combined with virtual reality offer novel opportunities to train fine force generation, an essential yet overlooked component of post-stroke rehabilitation. This study proposes that manipulating the rendered dynamics of tangible virtual objects can be leveraged to train precise force control while engaging the somatosensory system. We conducted an experiment with fifty healthy participants who performed a curling-inspired task in which they had to stretch a virtual spring to generate a target release force to propel the stone to a predefined location on the ice sheet. During training, the spring's force-elongation relationship was modeled as either a linear or non-linear function, i.e., a Gaussian or antisymmetric Gaussian (AS-Gaussian) function with zero derivative at the release target force. Results indicate that the AS-Gaussian group consistently achieved higher force accuracy during training than the linear group, while the Gaussian group only outperformed the linear group toward the end of training. Analysis of personality traits revealed that higher Free Spirit scores were associated with poorer performance and reduced task exploration under Gaussian dynamics, whereas higher Transform-of-Challenge scores correlated with increased exploration. Despite these training effects, no significant differences in long-term retention were found across spring types or personality traits. Participants primarily relied on learned target elongation rather than target force, as evidenced by performance in a transfer task with a different stiffness but the same target force. While promising for somatosensory neurorehabilitation, these methods require refinement to reduce reliance on proprioceptive cues before testing with neurological patients.

2026-05-26T09:51:49Z Alberto Garzás-Villar Alba Riera-Cardona Alexis Derumigny J. Micah Prendergast Jane Murray Cramm Laura Marchal-Crespo http://arxiv.org/abs/2605.26620v1 Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering 2026-05-26T06:59:04Z

Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.

2026-05-26T06:59:04Z Lukas Ellinger Alexander Fichtl Miriam Anschütz Georg Groh http://arxiv.org/abs/2606.07568v1 A Systematic Study of Behavioral Cloning for Scientific Data Annotation 2026-05-26T02:19:47Z

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

2026-05-26T02:19:47Z ICML 2026 Oral Ishaan Singh Chandok Core Francisco Park