https://arxiv.org/api/N5I/DalQ6Sm5ctP/ZHVgOeV3C7Y2026-06-21T11:30:45Z2899763015http://arxiv.org/abs/2605.21378v2Auditing Apple's DifferentialPrivacy.framework: Implementation Bugs, Misconfigurations, and Practical Risks2026-05-21T16:31:48ZSince 2016, Apple has claimed that device analytics collected to improve user experience are protected by differential privacy (DP). Apple's DifferentialPrivacy framework is deployed across its operating systems and handles sensitive signals such as Safari domains, keyboard events, photo attributes, and health-related reports. Because Apple has not open-sourced its privatization algorithms, these privacy claims have been difficult to verify independently. We present a client-side audit of Apple's DP framework on macOS Sonoma 14.2 and Sequoia 15.6. We reverse engineer the shipped binaries, recover Objective-C interfaces, build runtime harnesses that execute Apple's deployed mechanisms, and test whether their outputs match the advertised privacy guarantees. Our audit covers nearly all active deployed mechanisms, including Count Median Sketch, Hadamard-CMS, randomized-response mechanisms, and Prio-style secure aggregation. We find multiple implementation bugs and misconfigurations. Every audited mechanism that relies on floating-point noise fails to meet its advertised DP or zero-knowledge proof guarantee, due to insecure samplers with known floating-point vulnerabilities. We also find secure-aggregation configurations with local DP disabled, exposing pre-aggregation records to any party with access to those logs. Overall, we find DP violations in 5 of 9 audited mechanisms, affecting 87% of data collection in macOS Sonoma and 68% in Sequoia. We also identify public leaked iPhone logs that can be decoded to recover private information, including Safari domains and keyboard emoji signals.2026-05-20T16:40:02Z19 pages, 9 figures, 1 table. Accepted at the 47th IEEE Symposium on Security and Privacy (IEEE S&P 2026); Distinguished Paper AwardProceedings of the 47th IEEE Symposium on Security and Privacy (IEEE S&P), 2026Rishav ChourasiaErgute BaoUzair JavaidXiaokui Xiao10.1109/SP63933.2026.00225http://arxiv.org/abs/2605.22687v1The efficiency-gain illusion: People underestimate the rate of AI use and overestimate its benefits on simple tasks2026-05-21T16:28:20ZPeople are increasingly turning to AI assistance for simple tasks, e.g., arithmetic, spell-check, and answering simple questions. But does AI assistance actually save users time and effort? We investigate people's propensity to use AI for cognitively simple tasks and assess whether their reliance is well-calibrated. Across three pre-registered user studies (N = 2691), we find that people frequently choose to use AI even when doing so is inefficient (i.e. provides no meaningful time or effort savings). We identify systematic miscalibration at two levels: (1) a self-estimate miscalibration where people on average believe that they are using AI less than they actually are, and (2) efficiency-gain illusions where people overestimate how much time and effort savings AI use affords. We also identify a session-level carryover effect where a participant's prior AI use leads to further AI adoption and entrenches their miscalibration about time savings. Our results shed light on the mechanisms and biases underlying people's choice of whether to use AI as well as the risk of an overreliance feedback loop.2026-05-21T16:28:20ZSunny YuMyra ChengAhmad JabbarIlia SucholutskyKatherine M. CollinsDan JurafskyRobert D. Hawkinshttp://arxiv.org/abs/2605.22612v1Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions2026-05-21T15:27:58ZBenchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.2026-05-21T15:27:58Z13 pages, 1 figureNaveen RamanSantiago Cortes-GomezMateo Dulce RubioFei FangBryan Wilderhttp://arxiv.org/abs/2606.14719v1Grand Challenges for the Convergence of Computational and Citizen Science Research Workshop Report2026-05-21T15:19:08ZThis report is an outcome of a Computing Community Consortium (CCC) visioning workshop on Grand Challenges for the Convergence of Computational and Citizen Science Research conducted on April 8-9, 2025, in Washington, D.C. as well as through several precursor virtual input-gathering sessions. These events brought together experts across relevant disciplines to develop a research agenda that brings to fruition the above vision on how humans and machines may team up to solve some of the world's most pressing scientific problems.
Citizen science delivers measurable economic and national value. Public participation in scientific research generates millions of dollars in volunteer labor value, extends government agency capacity, and directly supports federal priorities in areas such as disaster management, public health, water, energy, workforce development, and many more. At the same time, 21st-century scientific infrastructure requirements for citizen science (from hardware and cyberinfrastructure to data and computational frameworks) mirror those for computational science more generally. The distributed, collaborative, long-term, and contextual nature of citizen science makes it a demanding real-world use case for a novel robust research infrastructure that accounts for security, privacy, resource adaptability, and transparency. In this report, we outline the key findings, future research directions, and recommendations that emerged from the April 2025 CCC Grand Challenges for the Convergence of Computational and Citizen Science Research Workshop.2026-05-21T15:19:08ZLucy FortsonLea ShanleyTanya Berger-WolfKevin CrowstonCorey JacksonSaiph SavageHaley Griffinhttp://arxiv.org/abs/2503.02885v3"Would You Want an AI Tutor?" Understanding Stakeholder Perceptions of LLM-based Systems in the Classroom2026-05-21T15:00:18ZLarge Language Models (LLMs) have gained traction in educational settings, often framed as virtual tutors or teaching assistants. Following early skepticism and bans, many schools and universities have begun integrating these systems into curricula. Yet decisions about whether and how to deploy LLM-based tools are frequently made without systematic engagement with the full range of stakeholders they affect. In this paper, we argue that understanding stakeholder perceptions of LLM-based systems in the classroom is not a matter of measuring approval or acceptance, but of identifying whose concerns are surfaced, in which contexts, and with what implications for responsible design and governance. We introduce Contextualized Perceptions for the Adoption of LLMs in Education (Co-PALE), a stakeholder-first framework that connects educational context, responsible AI principles, and categories of perception to support more deliberate decision-making about the adoption of LLM-based tools.
We ground Co-PALE through a targeted analysis of prior work to diagnose recurring gaps in how stakeholder perceptions are studied, and through contextually distinct educational scenarios that illustrate how the same technology raises different concerns for different stakeholders. We further examine how university faculty and K--12 parents make sense of the framework through focus groups, using their reflections to surface tensions and uncertainties. Co-PALE supports more systematic reasoning about whether, where, and for whom LLM-based tools should be deployed in education.2025-02-02T16:50:08ZCaterina FuligniDaniel Dominguez FigaredoArmanda LewisJulia Stoyanovichhttp://arxiv.org/abs/2603.16672v2CritiSense: Critical Digital Literacy and Resilience Against Misinformation2026-05-21T14:21:58ZMisinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 6 months, we have reached 500+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en).2026-03-17T15:37:49Zresilience, disinformation, misinformation, fake news, propagandaFiroj AlamFatema AhmadAli Ezzat ShahroorMohamed Bayan KmainasiElisa SartoriGiovanni Da San MartinoAbul HasnatRaian Alihttp://arxiv.org/abs/2604.28177v2AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images2026-05-21T14:15:16ZWe introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.2026-04-30T17:56:58ZAccepted to ACL 2026 Main ConferenceBo ZhangTzu-Yen MaZichen TangJunpeng DingZirui WangYizhuo ZhaoPeilin GaoZijie XiZixin DingHaiyang SunHaocheng GaoYuan LiuLiangjia WangYiling HuangYujie WangYuyue ZhangRonghui XiYuanze LiJiacheng LiuZhongjun YangHaihong Ehttp://arxiv.org/abs/2605.22391v1Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings2026-05-21T12:23:38ZWe present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.2026-05-21T12:23:38ZJakub RadzikowskiJosef Chenhttp://arxiv.org/abs/2605.18372v2The Hidden Cost of Contextual Sycophancy: an AI Literacy Intervention in Human-AI Collaboration2026-05-21T08:18:02ZLarge Language Models (LLMs) are increasingly used in educational settings as interactive tools for collaboration. However, their tendency toward sycophancy, aligning with user beliefs even when incorrect, raises concerns for learning and decision-making, especially for less knowledgeable users. This study investigates how sycophantic alignment emerges in authentic multi-turn human-AI interactions and whether interventions targeting increasing AI literacy and prompting competencies can mitigate its effects. In a controlled mixed-design experiment, 60 participants completed analytical survival ranking tasks by first generating individual rankings and then making final decisions after collaborating with an AI assistant, both before and after receiving either general or sycophancy-focused prompting training. Preliminary results show that LLMs are highly sensitive to user input: lower-quality initial responses lead to poorer AI advice, suggesting that the model mirrors or incorporates user reasoning rather than correcting it or offering better alternatives that are missing or less frequent in the conversation. Critically, the propagation of user errors into AI responses significantly reduced both the quality of AI feedback and final user task performance, revealing a form of contextual sycophantic dependence. While the intervention did not eliminate the propagation of contextual errors, it significantly improved AI advice by reducing the direct mirroring of incorrect user rankings. These findings suggest that prompting and AI literacy alone may be insufficient to ensure epistemically independent AI support, highlighting the need for system-level approaches that better promote critical engagement in human-AI collaboration.2026-05-18T13:20:45ZSPRINGER AIED 2026: Accepted for LBR, poster presentation at the 27th International Conference on Artificial Intelligence in Education, 27 Jun - 3 Jul 2026, Seoul, Republic of KoreaCansu KoyuturkSabrina GuidottiDimitri Ognibenehttp://arxiv.org/abs/2605.22109v1Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?2026-05-21T07:42:47ZMultimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.2026-05-21T07:42:47ZCaixin KangTianyu YanSitong GongMingfang ZhangLiangyang OuyangRuicong LiuBo ZhengHuchuan LuKaipeng ZhangYoichi SatoYifei Huanghttp://arxiv.org/abs/2511.04106v5Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names2026-05-21T07:42:33ZThe diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $α$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $α$ or the duration $T$; and (iii) $α$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $α$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.2025-11-06T06:44:45ZPhysical Review E (2026)Hayafumi Watanabe10.1103/f3d5-2tb8http://arxiv.org/abs/2605.21962v1AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems2026-05-21T03:48:31ZSerious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.2026-05-21T03:48:31ZBook chapter, 1 figure. To appear in "Advances in Global Applied Artificial Intelligence," G. A. Tsihrintzis, M. Virvou, N. G. Bourbakis, and L. C. Jain (Eds.), Springer, Learning and Analytics in Intelligent Systems book series, 2026Priyamvada TripathiBill Kapraloshttp://arxiv.org/abs/2605.21956v1Detecting Offensive Cyber Agents: A Detection-in-Depth Approach2026-05-21T03:44:02ZArtificial Intelligence (AI) agents can now orchestrate cyberattacks. This development is already increasing the speed and scale of cyber attacks, decreasing attack costs, and improving the operational autonomy of cyber capabilities. To defend against these emerging threats, actors must first develop the capability to detect them. This report frames the offensive cyber agent detection challenge by outlining the coming detection gap between offensive cyber agents and traditional cyber capabilities; introducing detection-in-depth, a strategic framework to guide policymakers and defenders responding to this detection gap; and presents five actionable detection mechanisms to support policymakers, industry, and defenders when putting this strategic framework into practice. These include (1) Agent Identifiers for Critical Infrastructure,(2) Agent Honeypots; (3) AI-Automated Alert Analysis and Triage: systems that use AI to filter, prioritize, and interpret the growing volume of detection signals expected from autonomous cyber operations; (4) An Agentic Security Alert Standard: A reporting standard model that providers can use to communicate agentic threats, improving the speed, consistency, and actionability of reports; (5) An Agentic Cybersecurity Exchange (ACE): an institution modeled on the Global Signal Exchange that brings together model and cloud providers to detect offensive cyber agent threats at their origin point and coordinate ecosystem-wide agentic threat disruption.2026-05-21T03:44:02Z95 pagesMatt MittelsteadtJam KraprayoonRobin Staes-PoletOskar GaleevJan WehnerChristopher CovinoShaun Eehttp://arxiv.org/abs/2606.07551v1Astro, I'm Home! Investigating Factors that Influence the Acceptance of Home Robots Using Supervised Machine Learning2026-05-21T00:04:09ZThe use of social robots in home environments is on the rise. This exploratory study applies regularization techniques (e.g., Lasso and Ridge regression) to investigate variables and identify new models of technology acceptance in the context of social robots. Within the original UTAUT2 framework, performance expectancy, social influence, and hedonic motivation emerged as the strongest and most consistent predictors of intention to use the technology. In addition, usability, trust, and competence were identified as promising variables in a model predicting intention to use.2026-05-21T00:04:09ZPreprint submitted to the 18th International Conference on Social Robotics (ICSR 2026)Katrin FischerEssence WilsonSteffie KimDmitri Williamshttp://arxiv.org/abs/2605.21816v1Barriers to Evidence in AI-Related Cases and the Privatization of Proof2026-05-20T23:33:44ZEvidence lies at the core of litigation, but it is increasingly difficult to obtain in AI-related disputes. Even when a claimant's position has merit, cases are often settled or dismissed because decisive facts are hidden inside proprietary models, platform logs, and protected databases. Grounding our discussion in past and ongoing cases, we investigate how asymmetries in access, resources, and expertise can create significant barriers to evidence in AI-related cases. We show how developers and deployers resist disclosure through various strategies challenging the value of the evidence to the requesting party and the cost of evidence production. From these patterns we identify seven recurring sources of asymmetry -- access to models, data, documentation, logs, expertise, compute, and infrastructure -- that reflect a broader pattern that we call the privatization of proof: when control over proof falls in the hands of private actors that can demand justification for access while ensuring that justification remains out of reach. We further argue that different types of access can be fungible: in the absence of a certain type of access (e.g., to model internals), one may be able to use alternative forms of access (e.g., sufficient compute, query access, and access to user logs) and to obtain a functionally equivalent amount of information. We propose a three-part test that can help resolve AI access disputes in litigation, drawing on concepts such as proportionality and reasonable alternatives. Our test relies on a few observations, including that the cause of action can provide a baseline for access.2026-05-20T23:33:44Z42 pages, 0 figures, 1 table, The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25--28, 2026, Montreal, QC, CanadaSarah H. CenHannah IsmaelLucia Zheng10.1145/3805689.3812219