https://arxiv.org/api/JUBnssRUgifZ6rCrzxXOoYV0qgc2026-06-14T14:09:00Z3093439015http://arxiv.org/abs/2605.27554v1What Catches the Eye? A Conjoint Study of Infographic Design Preferences2026-05-26T18:23:07ZInfographic designers balance many choices at once: chart type, color, and whether to add a benchmark or a scale. Past work studies these factors one at a time, so we know little about how readers weigh them against each other. We address this gap with a choice-based conjoint study (N = 65) in which participants viewed pairs of infographics on a mock newspaper page about unemployment. Each infographic varied across three attributes: comparison type (none, US average, percentage scale), color (red, blue), and graphic type (single icon, icon series, bar chart). Comparison type drove most of the preference variation (58.5%), followed by graphic type (29.2%) and color (12.3%). Readers favored percentage scale markers and benchmark comparisons; color had no practical effect. The percentage scale level adds axis information rather than a benchmark, so the comparison type result mixes two distinct ideas. A single topic and a narrow palette also limit external validity. We argue that conjoint analysis is a practical and underused tool for studying visualization preferences across many design dimensions.2026-05-26T18:23:07ZAmit Kumar DasKaranbir PeliaManav Nitesh UkaniKlaus Muellerhttp://arxiv.org/abs/2605.27546v1Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies2026-05-26T18:16:29ZCrisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.2026-05-26T18:16:29ZAbeer BadawiWill AitkenLydia SequeiraJocelyn RankinMaia NormanElham Dolatabadihttp://arxiv.org/abs/2601.07085v2The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance2026-05-26T17:53:46ZLarge language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.2026-01-11T22:28:56Z16 pages, 20 references. v2: Added brief discussion situating "honest signals" terminology in evolutionary biology (Sec. 3), with two added citations (Zahavi 1975; Maynard Smith & Harper 2003). No changes to argument or conclusionsAndrew D. Maynardhttp://arxiv.org/abs/2605.27299v1Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models2026-05-26T17:11:21ZModern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.2026-05-26T17:11:21ZMurat Moranhttp://arxiv.org/abs/2605.27261v1Atari Games Challenge: A Pilot Study on Multimodal Player Experience Assessment2026-05-26T16:39:18ZWe present a pilot study on the collection and synchronisation of multimodal data for player experience investigation. We collected game telemetry, self-reported surveys, biometrics, and cued-retrospective think-aloud (C-RTA) data from 19 participants playing three Atari 2600 games. The study then uses the data to investigate difficulty in PX, showcasing a protocol for future multimodal research.
The dataset obtained from the experiment, which is publicly available, shows potential as a rich, transformative source that can be used to investigate dynamic difficulty adjustment algorithms, game balancing strategies or broader explorations of games user research. The study findings suggest that the experimental approach holds strong potential for generalisation in future player experience studies.2026-05-26T16:39:18ZOleg Jarma MontoyaErica MancaThomas Vase Schultz VoldenPaolo Burellihttp://arxiv.org/abs/2606.00106v1A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces2026-05-26T16:13:15ZBrain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by α, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning α yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.2026-05-26T16:13:15ZJavier JiménezFrancisco B Rodríguezhttp://arxiv.org/abs/2510.00902v2Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification2026-05-26T15:31:35ZTransfer learning is crucial for medical imaging, yet the selection of source datasets often relies on researchers' intuition rather than systematic principles, which can impact the generalizability of algorithms and, thus, patient outcomes. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-computer interaction (HCI) perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Moreover, ethical and fairness considerations remain largely absent from source dataset sections. Participants often used ambiguous terminology, which suggests a need for clearer definitions and tools to make them explicit and usable. By clarifying these heuristics and introducing a conceptual framework of transfer learning factors, this work provides practical insights for more systematic source selection in transfer learning.2025-10-01T13:44:46ZUnder reviewYucheng LuHubert Dariusz ZającVeronika CheplyginaAmelia Jiménez-Sánchezhttp://arxiv.org/abs/2510.10774v3ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis2026-05-26T13:43:37ZPersian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.2025-10-12T19:33:11ZMohammad Javad Ranjbar KalahroodiHeshaam FailiAzadeh Shakeryhttp://arxiv.org/abs/2605.15850v2Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education2026-05-26T11:31:22ZIn recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 higher education students, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, thus outperforming both approaches without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical strategy that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.2026-05-15T11:02:16ZJanne RotterPau Benazet i MontobbioDavinia Hernández-Leohttp://arxiv.org/abs/2605.26870v1Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study2026-05-26T11:28:36ZBackground: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.2026-05-26T11:28:36Z19 pages, 2 figures, 3 main tables; supplementary appendix with 6 tables, 2 figures, and a reproducibility methods section. Describes 17 configured agents in a persistent research environment and introduces the PARE-M (Persistent Agentic Research Environment Measurement) frameworkAnas H. Alzahranihttp://arxiv.org/abs/2605.26858v1Rethinking AI Psychosis: Misnomers, Conceptual Limits, and Existential Drift2026-05-26T11:19:08ZThere has been a proliferation of media reports about so-called AI psychosis in the last year. Not surprisingly, this has prompted growing academic work on the ways in which AI chatbots such as ChatGPT, Claude, and Replika might aggravate or even induce psychosis, typically understood in terms of users acquiring or maintaining delusional beliefs. Our paper consists of two parts. First, we provide a number of reasons to be sceptical about understanding 'AI psychosis' as a novel psychiatric category. We argue that many of the purportedly new phenomena are better understood through Stompe et al.'s (2003) metaphor of 'old wine in new bottles' and highlight conceptual, nosological, clinical, and social risks associated with the uncritical adoption of this terminology. Second, we develop a positive phenomenological account of what may nevertheless be at stake in sustained human-AI interaction. Rather than focusing primarily on whether AI systems induce, amplify, or sediment delusional beliefs, we examine how conversational AI may participate in transforming a person's lived experience of reality itself. We claim that the sycophantic and pseudo-intersubjective nature of AI could lead to what we call "existential drift", whereby individuals may continue to feel rooted in a shared reality through their interactions with AI, while actually becoming entrenched in increasingly private and subjective worlds.2026-05-26T11:19:08ZKasper Møller NielsenLucy Oslerhttp://arxiv.org/abs/2604.11467v2From Attribution to Action: A Human-Centered Application of Activation Steering2026-05-26T10:03:53ZExplainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.2026-04-13T13:41:57ZTobias LabartaMaximilian DreyerKatharina WeitzWojciech SamekSebastian Lapuschkinhttp://arxiv.org/abs/2605.26782v1Manipulating Tangible Virtual Object Dynamics to Promote Learning of Precision Force Generation2026-05-26T09:51:49ZRobotic haptic devices combined with virtual reality offer novel opportunities to train fine force generation, an essential yet overlooked component of post-stroke rehabilitation. This study proposes that manipulating the rendered dynamics of tangible virtual objects can be leveraged to train precise force control while engaging the somatosensory system. We conducted an experiment with fifty healthy participants who performed a curling-inspired task in which they had to stretch a virtual spring to generate a target release force to propel the stone to a predefined location on the ice sheet. During training, the spring's force-elongation relationship was modeled as either a linear or non-linear function, i.e., a Gaussian or antisymmetric Gaussian (AS-Gaussian) function with zero derivative at the release target force. Results indicate that the AS-Gaussian group consistently achieved higher force accuracy during training than the linear group, while the Gaussian group only outperformed the linear group toward the end of training. Analysis of personality traits revealed that higher Free Spirit scores were associated with poorer performance and reduced task exploration under Gaussian dynamics, whereas higher Transform-of-Challenge scores correlated with increased exploration. Despite these training effects, no significant differences in long-term retention were found across spring types or personality traits. Participants primarily relied on learned target elongation rather than target force, as evidenced by performance in a transfer task with a different stiffness but the same target force. While promising for somatosensory neurorehabilitation, these methods require refinement to reduce reliance on proprioceptive cues before testing with neurological patients.2026-05-26T09:51:49ZAlberto Garzás-VillarAlba Riera-CardonaAlexis DerumignyJ. Micah PrendergastJane Murray CrammLaura Marchal-Crespohttp://arxiv.org/abs/2605.26620v1Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering2026-05-26T06:59:04ZNatural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.2026-05-26T06:59:04ZLukas EllingerAlexander FichtlMiriam AnschützGeorg Grohhttp://arxiv.org/abs/2606.07568v1A Systematic Study of Behavioral Cloning for Scientific Data Annotation2026-05-26T02:19:47ZScientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.2026-05-26T02:19:47ZICML 2026 OralIshaan Singh ChandokCore Francisco Park