https://arxiv.org/api/zUzBLDEl5BESzCRps0EWh/ZW5Tg2026-06-19T03:28:51Z2899752515http://arxiv.org/abs/2503.02857v5Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 20242026-05-27T00:37:53ZIn the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.2025-03-04T18:33:22ZNuria Alina ChandraHannah LeeRyan MurtfeldtLin QiuArnab KarmakarEmmanuel TanumihardjaKevin FarhatBen CaffeeChangyeon LeeJongwook ChoiSejin PaikAerin KimOren Etzionihttp://arxiv.org/abs/2506.04975v2Evaluating Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards2026-05-26T23:09:48ZRecent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. However, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, cross-model analysis of refusal behavior and persona-driven toxicity amplification across four Chinese LLMs, leveraging a comprehensive dataset of over 1,400,000 generated texts. We identify significant disparities in persona-driven refusal behavior, including systematic gender differences in refusal triggering across the evaluated Chinese LLMs. Furthermore, we provide quantitative evidence of persona-driven toxicity amplification with respect to model default baselines. We show that this amplification--whose magnitude varies substantially across models--is driven by interactions across several factors, involving persona conditioning, prompting strategy, target social group, and model-specific safety mechanisms. Leveraging model-specific regression analyses, we systematically characterize how persona categories, target social groups, and prompt templates independently and jointly shape both refusal behavior and output toxicity. As a complementary case study, we further explore an iterative, evaluator-guided mitigation strategy based on model feedback with an external LLM evaluator, demonstrating that highly toxic outputs can be substantially reduced without costly model retraining. Overall, our findings highlight the importance of culturally contextualized safety evaluations for Chinese-language LLMs and provide a structured framework for assessing persona-induced risks and exploratory mitigation strategies in LLM-generated content.2025-06-05T12:47:21ZGeng LiuLi FengCarlo Alberto BonoSongbo YangMengxiao ZhuFrancesco Pierrihttp://arxiv.org/abs/2605.27654v1Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability2026-05-26T20:14:07ZGenerative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.2026-05-26T20:14:07Z10 pages, 2 figures, 9 tablesSamyak SaviChavi GuptaShreyas GantayetTanay SodhaDhruv Kumarhttp://arxiv.org/abs/2512.20780v3Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles2026-05-26T19:48:54ZRecent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.2025-12-23T21:29:09ZRamatu Oiza AbdulsalamSegun Aroyehunhttp://arxiv.org/abs/2605.16293v2From Prediction to Intervention: The Evolution of AI in Biomedicine2026-05-26T19:18:47ZArtificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate prediction of clinical outcomes and patient stratification. These systems, however, remain fundamentally observational: they learn statistical associations from historical data and operate within previously observed biological and clinical states, limiting their ability to generalize to novel therapies or unobserved interventions.
We argue that AI in biomedicine is undergoing a structural transition. As biomedical decision-making increasingly depends on reasoning about intervention rather than extrapolation from past observations, predictive architectures become structurally insufficient. Systems that learn from historical data cannot, by construction, represent how biological systems evolve under perturbation, and therefore cannot reliably support decision-making in the presence of novel interventions.
We introduce a conceptual framework distinguishing observational and interventional intelligence and define disease-level models as systems that explicitly represent the state, dynamics, and intervention response of biological processes. These models enable a shift from inference to simulation -- reasoning about what will happen under intervention rather than what is likely based on the past.
This transition also implies a shift in where value is created: from data processing and prediction toward systems that support and define decision-making under intervention. It follows directly from the structure of biomedical decision-making and defines the next stage of AI in medicine. Systems that cannot model intervention will be structurally excluded from decision-making.2026-04-14T17:49:51Z10 pages, 3 figures, 1 table. Figures were replaced with a better versionsAndrew FeinbergAleksandr SarachakovViktor SvekolkinAlexander BagaevFerran PratMichael Feinberghttp://arxiv.org/abs/2605.27371v1Algorithmic Monocultures in Hiring2026-05-26T17:59:55ZMany employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human2026-05-26T17:59:55ZPublished at FAccT 2026. Website: https://algorithmichiring.github.io/Rishi BommasaniSarah H. BanaKathleen A. CreelDan JurafskyPercy Lianghttp://arxiv.org/abs/2601.07085v2The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance2026-05-26T17:53:46ZLarge language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.2026-01-11T22:28:56Z16 pages, 20 references. v2: Added brief discussion situating "honest signals" terminology in evolutionary biology (Sec. 3), with two added citations (Zahavi 1975; Maynard Smith & Harper 2003). No changes to argument or conclusionsAndrew D. Maynardhttp://arxiv.org/abs/2605.27320v1Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding2026-05-26T17:28:30ZAgentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.2026-05-26T17:28:30ZMuhammad Zia HydariRaja IqbalNarayan Ramasubbuhttp://arxiv.org/abs/2605.27202v1Queue & AI: When Faster Tasks Slow Down the Workflow2026-05-26T15:57:41ZQuantifying the workplace productivity effects of Generative Artificial Intelligence is now central to economics, management, and public policy. The deployment of AI tools in customer service, writing, software development, and consulting operations has been reported to generate large per-task productivity gains, typically measured as tasks completed per worker-hour or reductions in mean handle time. We argue that such mean-based metrics can misrepresent AI's effects in workflows where tasks accumulate and compete for scarce human attention. AI assistance can generate a deceptive productivity signature: average completion times fall because AI tools typically supply a fast first draft, yet workflow-level performance deteriorates when a subset of AI errors escapes review and returns as costly downstream rework. We call this divergence between mean task speed and system-level delay the variance wedge. Depending on the operational parameters, the most time-efficient way to complete a workflow may undergo a transition between two task-processing regimes, a fully AI-assisted and a fully manual one. We formalize the mechanism as a queueing model and derive two main implications analytically. First, under congestion, reviewers rationally raise the risk threshold for checking AI outputs, reducing scrutiny precisely when it would matter the most. Second, AI assistance can stabilize an overloaded workflow only when (i) the fraction of tasks handled by AI exceeds a critical threshold, and (ii) the human attention required for review and expected rework is lower than the attention for manual completion, a requirement substantially more stringent than faster draft generation. These results suggest that AI deployment should be evaluated not only by average task speed, but by its overall effects on congestion, rework, and the robustness of human oversight under load.2026-05-26T15:57:41Z20 pages, 6 figuresSilvia BartolucciPierpaolo Vivohttp://arxiv.org/abs/2601.04512v3Application of Hybrid Chain Storage Framework in Energy Trading and Carbon Asset Management2026-05-26T15:53:49ZDistributed energy trading and carbon asset management involve high-frequency, small-value settlements with strong audit requirements. Fully on-chain designs incur excessive cost, while purely off-chain approaches lack verifiable consistency. This paper presents a hybrid on-chain and off-chain settlement framework that anchors settlement commitments and key constraints on-chain and links off-chain records through deterministic digests and replayable auditing. Experiments under publicly constrained workloads show that the framework significantly reduces on-chain execution and storage cost while preserving audit trustworthiness.2026-01-08T02:27:34Z13 pages, 5 figuresYinghan HouZongyou YangXiaokun Yanghttp://arxiv.org/abs/2510.00902v2Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification2026-05-26T15:31:35ZTransfer learning is crucial for medical imaging, yet the selection of source datasets often relies on researchers' intuition rather than systematic principles, which can impact the generalizability of algorithms and, thus, patient outcomes. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-computer interaction (HCI) perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Moreover, ethical and fairness considerations remain largely absent from source dataset sections. Participants often used ambiguous terminology, which suggests a need for clearer definitions and tools to make them explicit and usable. By clarifying these heuristics and introducing a conceptual framework of transfer learning factors, this work provides practical insights for more systematic source selection in transfer learning.2025-10-01T13:44:46ZUnder reviewYucheng LuHubert Dariusz ZającVeronika CheplyginaAmelia Jiménez-Sánchezhttp://arxiv.org/abs/2605.27174v1An investigation of AI integration in sound designer workflows and experiences2026-05-26T15:28:51ZArtificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.2026-05-26T15:28:51ZNelly GarciaJoshua Reisshttp://arxiv.org/abs/2605.27171v1Faults and Pitfalls in Implementing the Right to be Forgotten2026-05-26T15:27:49ZRight to be Forgotten (RTBF) in one of the oldest and prominent of the legal data rights. While its legal intention is straight forward (for example, the GDPR describes it in just 417 words), the computing community has found it challenging to implement this in practice. For example, regulators have issued 205 RTBF violations in the first five years of GDPR i.e., an RTBF failure once every 9 days, on average. In this work, we identify the uncertainties and risks in supporting RTBF from a computing perspective. Then, to mitigate these challenges, we propose a two-phase approach that bridges an intrinsic dichotomy between law and computing. We demonstrate the effectiveness of our technique by showing how it could have fully avoided 80% of RTBF violations that occurred in the year-6 of GDPR. We also discover six long-standing practices of computing and data management that have become anti-patterns for RTBF. Finally, to ground our research, we introduce RTBF capability into Elasticsearch, a popular open-source search engine.2026-05-26T15:27:49ZCommunications of the ACM 69(6), 2026Chen SunNikolas GuggenbergerSupreeth Shastri10.1145/3807515http://arxiv.org/abs/2605.27168v1Grounding Text Embeddings in Stakeholder Associations2026-05-26T15:24:15ZText embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.2026-05-26T15:24:15ZJonathan RystrømSofie Burgos-ThorsenZihao FuJohan Irving SøltoftKenneth C. EnevoldsenChris Russellhttp://arxiv.org/abs/2510.07478v2Fixed Points and Stochastic Meritocracies: A Long-Term Perspective2026-05-26T15:21:44ZWe study group fairness in the context of feedback loops induced by meritocratic selection into programs that themselves confer additional advantage, like college admissions. We introduce a stylized, yet novel inter-generational model for the setting and analyze it in situations where there are no underlying differences between two populations. When the benefit of the program (or the harm of not getting into it) is completely symmetric, we show that disparities between the two populations will vanish on average in the long term, although in the short term disparities will continue to arise and dissipate cyclically. Further, the time an accumulated advantage takes to dissipate can be significant, and increases as a function of the relative importance of the program in conveying benefits. Interestingly, significant disparities can arise purely due to randomness even from completely symmetric initial conditions, especially when populations are small. The introduction of even a slight asymmetry, where the group that has accumulated an advantage becomes slightly preferred, leads to a completely different outcome. In these instances, starting from completely symmetric initial conditions, disparities between groups arise stochastically and then persist over time, yielding a permanent advantage for one group. Our analysis precisely characterizes conditions under which disparities persist or diminish, with a particular focus on the role of the scarcity of available spots in the program and its effectiveness. We also present extensive simulations in a richer model that further support our theoretical results in the simpler, stylized model. Our findings are relevant for the design and implementation of algorithmic fairness interventions in similar selection processes.2025-10-08T19:23:57Z45 pages, accepted to ACM FAccT 2026Gaurab PokharelDiptangshu SenSanmay DasJuba Ziani