https://arxiv.org/api/zUzBLDEl5BESzCRps0EWh/ZW5Tg 2026-06-19T03:28:51Z 28997 525 15 http://arxiv.org/abs/2503.02857v5 Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024 2026-05-27T00:37:53Z

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

2025-03-04T18:33:22Z Nuria Alina Chandra Hannah Lee Ryan Murtfeldt Lin Qiu Arnab Karmakar Emmanuel Tanumihardja Kevin Farhat Ben Caffee Changyeon Lee Jongwook Choi Sejin Paik Aerin Kim Oren Etzioni http://arxiv.org/abs/2506.04975v2 Evaluating Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards 2026-05-26T23:09:48Z

Recent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. However, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, cross-model analysis of refusal behavior and persona-driven toxicity amplification across four Chinese LLMs, leveraging a comprehensive dataset of over 1,400,000 generated texts. We identify significant disparities in persona-driven refusal behavior, including systematic gender differences in refusal triggering across the evaluated Chinese LLMs. Furthermore, we provide quantitative evidence of persona-driven toxicity amplification with respect to model default baselines. We show that this amplification--whose magnitude varies substantially across models--is driven by interactions across several factors, involving persona conditioning, prompting strategy, target social group, and model-specific safety mechanisms. Leveraging model-specific regression analyses, we systematically characterize how persona categories, target social groups, and prompt templates independently and jointly shape both refusal behavior and output toxicity. As a complementary case study, we further explore an iterative, evaluator-guided mitigation strategy based on model feedback with an external LLM evaluator, demonstrating that highly toxic outputs can be substantially reduced without costly model retraining. Overall, our findings highlight the importance of culturally contextualized safety evaluations for Chinese-language LLMs and provide a structured framework for assessing persona-induced risks and exploratory mitigation strategies in LLM-generated content.

2025-06-05T12:47:21Z Geng Liu Li Feng Carlo Alberto Bono Songbo Yang Mengxiao Zhu Francesco Pierri http://arxiv.org/abs/2605.27654v1 Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability 2026-05-26T20:14:07Z

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

2026-05-26T20:14:07Z 10 pages, 2 figures, 9 tables Samyak Savi Chavi Gupta Shreyas Gantayet Tanay Sodha Dhruv Kumar http://arxiv.org/abs/2512.20780v3 Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles 2026-05-26T19:48:54Z

Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

2025-12-23T21:29:09Z Ramatu Oiza Abdulsalam Segun Aroyehun http://arxiv.org/abs/2605.16293v2 From Prediction to Intervention: The Evolution of AI in Biomedicine 2026-05-26T19:18:47Z

Artificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate prediction of clinical outcomes and patient stratification. These systems, however, remain fundamentally observational: they learn statistical associations from historical data and operate within previously observed biological and clinical states, limiting their ability to generalize to novel therapies or unobserved interventions. We argue that AI in biomedicine is undergoing a structural transition. As biomedical decision-making increasingly depends on reasoning about intervention rather than extrapolation from past observations, predictive architectures become structurally insufficient. Systems that learn from historical data cannot, by construction, represent how biological systems evolve under perturbation, and therefore cannot reliably support decision-making in the presence of novel interventions. We introduce a conceptual framework distinguishing observational and interventional intelligence and define disease-level models as systems that explicitly represent the state, dynamics, and intervention response of biological processes. These models enable a shift from inference to simulation -- reasoning about what will happen under intervention rather than what is likely based on the past. This transition also implies a shift in where value is created: from data processing and prediction toward systems that support and define decision-making under intervention. It follows directly from the structure of biomedical decision-making and defines the next stage of AI in medicine. Systems that cannot model intervention will be structurally excluded from decision-making.

2026-04-14T17:49:51Z 10 pages, 3 figures, 1 table. Figures were replaced with a better versions Andrew Feinberg Aleksandr Sarachakov Viktor Svekolkin Alexander Bagaev Ferran Prat Michael Feinberg http://arxiv.org/abs/2605.27371v1 Algorithmic Monocultures in Hiring 2026-05-26T17:59:55Z

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human

2026-05-26T17:59:55Z Published at FAccT 2026. Website: https://algorithmichiring.github.io/ Rishi Bommasani Sarah H. Bana Kathleen A. Creel Dan Jurafsky Percy Liang http://arxiv.org/abs/2601.07085v2 The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance 2026-05-26T17:53:46Z

Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.

2026-01-11T22:28:56Z 16 pages, 20 references. v2: Added brief discussion situating "honest signals" terminology in evolutionary biology (Sec. 3), with two added citations (Zahavi 1975; Maynard Smith & Harper 2003). No changes to argument or conclusions Andrew D. Maynard http://arxiv.org/abs/2605.27320v1 Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding 2026-05-26T17:28:30Z

Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.

2026-05-26T17:28:30Z Muhammad Zia Hydari Raja Iqbal Narayan Ramasubbu http://arxiv.org/abs/2605.27202v1 Queue & AI: When Faster Tasks Slow Down the Workflow 2026-05-26T15:57:41Z

Quantifying the workplace productivity effects of Generative Artificial Intelligence is now central to economics, management, and public policy. The deployment of AI tools in customer service, writing, software development, and consulting operations has been reported to generate large per-task productivity gains, typically measured as tasks completed per worker-hour or reductions in mean handle time. We argue that such mean-based metrics can misrepresent AI's effects in workflows where tasks accumulate and compete for scarce human attention. AI assistance can generate a deceptive productivity signature: average completion times fall because AI tools typically supply a fast first draft, yet workflow-level performance deteriorates when a subset of AI errors escapes review and returns as costly downstream rework. We call this divergence between mean task speed and system-level delay the variance wedge. Depending on the operational parameters, the most time-efficient way to complete a workflow may undergo a transition between two task-processing regimes, a fully AI-assisted and a fully manual one. We formalize the mechanism as a queueing model and derive two main implications analytically. First, under congestion, reviewers rationally raise the risk threshold for checking AI outputs, reducing scrutiny precisely when it would matter the most. Second, AI assistance can stabilize an overloaded workflow only when (i) the fraction of tasks handled by AI exceeds a critical threshold, and (ii) the human attention required for review and expected rework is lower than the attention for manual completion, a requirement substantially more stringent than faster draft generation. These results suggest that AI deployment should be evaluated not only by average task speed, but by its overall effects on congestion, rework, and the robustness of human oversight under load.

2026-05-26T15:57:41Z 20 pages, 6 figures Silvia Bartolucci Pierpaolo Vivo http://arxiv.org/abs/2601.04512v3 Application of Hybrid Chain Storage Framework in Energy Trading and Carbon Asset Management 2026-05-26T15:53:49Z

Distributed energy trading and carbon asset management involve high-frequency, small-value settlements with strong audit requirements. Fully on-chain designs incur excessive cost, while purely off-chain approaches lack verifiable consistency. This paper presents a hybrid on-chain and off-chain settlement framework that anchors settlement commitments and key constraints on-chain and links off-chain records through deterministic digests and replayable auditing. Experiments under publicly constrained workloads show that the framework significantly reduces on-chain execution and storage cost while preserving audit trustworthiness.

2026-01-08T02:27:34Z 13 pages, 5 figures Yinghan Hou Zongyou Yang Xiaokun Yang http://arxiv.org/abs/2510.00902v2 Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification 2026-05-26T15:31:35Z

Transfer learning is crucial for medical imaging, yet the selection of source datasets often relies on researchers' intuition rather than systematic principles, which can impact the generalizability of algorithms and, thus, patient outcomes. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-computer interaction (HCI) perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Moreover, ethical and fairness considerations remain largely absent from source dataset sections. Participants often used ambiguous terminology, which suggests a need for clearer definitions and tools to make them explicit and usable. By clarifying these heuristics and introducing a conceptual framework of transfer learning factors, this work provides practical insights for more systematic source selection in transfer learning.

2025-10-01T13:44:46Z Under review Yucheng Lu Hubert Dariusz Zając Veronika Cheplygina Amelia Jiménez-Sánchez http://arxiv.org/abs/2605.27174v1 An investigation of AI integration in sound designer workflows and experiences 2026-05-26T15:28:51Z

Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.

2026-05-26T15:28:51Z Nelly Garcia Joshua Reiss http://arxiv.org/abs/2605.27171v1 Faults and Pitfalls in Implementing the Right to be Forgotten 2026-05-26T15:27:49Z

Right to be Forgotten (RTBF) in one of the oldest and prominent of the legal data rights. While its legal intention is straight forward (for example, the GDPR describes it in just 417 words), the computing community has found it challenging to implement this in practice. For example, regulators have issued 205 RTBF violations in the first five years of GDPR i.e., an RTBF failure once every 9 days, on average. In this work, we identify the uncertainties and risks in supporting RTBF from a computing perspective. Then, to mitigate these challenges, we propose a two-phase approach that bridges an intrinsic dichotomy between law and computing. We demonstrate the effectiveness of our technique by showing how it could have fully avoided 80% of RTBF violations that occurred in the year-6 of GDPR. We also discover six long-standing practices of computing and data management that have become anti-patterns for RTBF. Finally, to ground our research, we introduce RTBF capability into Elasticsearch, a popular open-source search engine.

2026-05-26T15:27:49Z Communications of the ACM 69(6), 2026 Chen Sun Nikolas Guggenberger Supreeth Shastri 10.1145/3807515 http://arxiv.org/abs/2605.27168v1 Grounding Text Embeddings in Stakeholder Associations 2026-05-26T15:24:15Z

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.

2026-05-26T15:24:15Z Jonathan Rystrøm Sofie Burgos-Thorsen Zihao Fu Johan Irving Søltoft Kenneth C. Enevoldsen Chris Russell http://arxiv.org/abs/2510.07478v2 Fixed Points and Stochastic Meritocracies: A Long-Term Perspective 2026-05-26T15:21:44Z

We study group fairness in the context of feedback loops induced by meritocratic selection into programs that themselves confer additional advantage, like college admissions. We introduce a stylized, yet novel inter-generational model for the setting and analyze it in situations where there are no underlying differences between two populations. When the benefit of the program (or the harm of not getting into it) is completely symmetric, we show that disparities between the two populations will vanish on average in the long term, although in the short term disparities will continue to arise and dissipate cyclically. Further, the time an accumulated advantage takes to dissipate can be significant, and increases as a function of the relative importance of the program in conveying benefits. Interestingly, significant disparities can arise purely due to randomness even from completely symmetric initial conditions, especially when populations are small. The introduction of even a slight asymmetry, where the group that has accumulated an advantage becomes slightly preferred, leads to a completely different outcome. In these instances, starting from completely symmetric initial conditions, disparities between groups arise stochastically and then persist over time, yielding a permanent advantage for one group. Our analysis precisely characterizes conditions under which disparities persist or diminish, with a particular focus on the role of the scarcity of available spots in the program and its effectiveness. We also present extensive simulations in a richer model that further support our theoretical results in the simpler, stylized model. Our findings are relevant for the design and implementation of algorithmic fairness interventions in similar selection processes.

2025-10-08T19:23:57Z 45 pages, accepted to ACM FAccT 2026 Gaurab Pokharel Diptangshu Sen Sanmay Das Juba Ziani