https://arxiv.org/api/lhAXdxzLCd6xYUXW23IhDDxObfY2026-06-13T12:48:22Z288864515http://arxiv.org/abs/2601.12164v3The Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents2026-06-09T21:18:31ZLarge language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison of LLM-generated political analyses of a Ukrainian civil society document, using semantically equivalent prompts in Russian and Ukrainian administered to two frontier models from different developers, ChatGPT 5.2 and Claude Opus 4.5. Despite identical source material and parallel query structures, both models diverged along the same axis: Russian-language outputs leaned toward delegitimizing framings, characterizing civil society actors as externally funded elites constraining a democratic mandate, while Ukrainian-language outputs treated the same actors as legitimate stakeholders in democratic contestation. The magnitude of this divergence, however, was model-dependent. ChatGPT's Russian output reproduced vocabulary characteristic of Russian state discourse; Claude Opus's stayed in a mainstream critical idiom and hedged its judgments in both languages. These findings demonstrate that prompt language alone can systematically shift the ideological orientation of an unchanged model analyzing identical content. The shift is a general property of multilingual LLMs whose severity, and whose alignment with propaganda narratives, varies across systems. The implications reach AI deployment in polarized information environments, cross-lingual research, and AI governance in multilingual societies.2026-01-17T21:00:36ZOleg Smirnovhttp://arxiv.org/abs/2606.11456v1AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable2026-06-09T21:16:31ZThe deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.2026-06-09T21:16:31ZMeysam AlizadehFabrizio GilardiMohsen MoslehEnkelejda Kasnecihttp://arxiv.org/abs/2501.16531v2Guardrails versus Gatekeepers: Understanding Product Managers' Ethical Decision-Making in Generative AI2026-06-09T21:00:23ZWhat is the role of product managers in the responsible use of generative AI (genAI) in products and everyday work -- and what enables or constrains their ability to take action? Past literature has examined the ways in which organizational policies can become decoupled from practices when incentives for responsible action are misaligned or impeded by profit motives. While the role of engineers and professional ethicists in the context of AI has been examined in detail, the role of product managers -- who are frequently portrayed as "gatekeepers" or critical decision-makers in product teams -- remains unclear. In this paper, we examine what organizational conditions promote responsible use of genAI by product managers by drawing on twenty-five interviews and a global survey of over three hundred respondents in product management-related roles. We find that uncertainty around responsible AI and a sense of diffused responsibility constrain ethical action, while leadership commitment and organizational principles enable ethical action -- making some responsible practices up to fourteen times more likely. Further, we find two sets of actions product managers take to "recouple" ethical commitments and practices. The first includes low-resource, individual actions product managers can implement without explicit organizational incentives. The second includes high-resource, collective actions that require organizational incentives. Our research suggests recoupling ethical policies and practices at the level of product teams requires institutional buy-in and higher level leadership commitment. Nevertheless, we show that individual actors are able to exhibit agency through some meaningful, low resource actions, even in the absence of organizational incentives, though this alone is insufficient to operationalize responsible AI at scale.2025-01-27T22:10:27ZTo appear in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)Genevieve SmithNatalia LukaMerrick OsborneBrian LattimoreJessica NewmanBrent MittelstadtBrandie Nonneckehttp://arxiv.org/abs/2510.18289v3Food4All: An Agentic Framework and Benchmark for Food Resource Navigation with Adaptive User Understanding2026-06-09T20:31:17ZFood assistance referral requires conversational agents to translate underspecified, often noisy help-seeking dialogues into locally valid resource recommendations. We present Food4All, an agentic food-resource referral framework and benchmark grounded in 686 structured Indiana food resources. Food4All couples a food-specific search tool with 300 multi-turn evaluation tasks spanning single food needs, composite cases with access or document constraints, and five non-ideal user interaction traits: unreasonable demands, rambling responses, impatience, incomplete answers, and inconsistent information. We evaluate six Large Language Models (LLMs) on requirement grounding, resource retrieval, final referral correctness, and interaction efficiency. Although the strongest model achieves 96.33% referral accuracy, our diagnostics reveal persistent failures in grounding schedule, eligibility, intake, and document constraints, as well as failures to preserve valid retrieved resources in the final recommendation. Trait-level analysis further shows that different non-ideal behaviors stress different parts of the referral pipeline. Food4All provides a controlled testbed for studying tool-calling agents in constraint-sensitive food assistance referral under realistic user interaction challenges.2025-10-21T04:35:02ZWe have further refined the benchmark construction and experimental presentation to improve clarity and consistency. The revised version includes updated task design, food-resource data, and evaluation details to better align the benchmark with the intended food resource referral setting. These changes provide a more precise presentation of the experimental findingsYiyang LiWeixiang SunTianyi MaKaiwen ShiZheyuan ZhangYanfang Yehttp://arxiv.org/abs/2606.11337v1Can AI Agents Synthesize Scientific Conclusions?2026-06-09T18:16:04ZScientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.2026-06-09T18:16:04Z79 pages, 34 figures, 17 tables. Under SubmissionHayoung JungPedro Viana DinizJosé Reinaldo Corrêa RovedaAbner Fernandes da SilvaHaeun JungEnoch TsaiAleksandra KorolovaManoel Horta Ribeirohttp://arxiv.org/abs/2606.11176v1Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories2026-06-09T17:51:55ZData tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.2026-06-09T17:51:55ZProject page: https://data2story.github.io Github: https://github.com/QinghongLin/data2story-skillKevin Qinghong LinBatu EIYuhong ShiPan LuPhilip TorrJames Zouhttp://arxiv.org/abs/2606.11150v1ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity2026-06-09T17:35:37ZLarge language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.2026-06-09T17:35:37Z18 pages. To be published in ICML 2026Andrew Bo LiuSamira NedungadiBryce CaiAlex KleinmanHarmon BhasinSeth Donoughehttp://arxiv.org/abs/2412.05163v5Support for AI Development -- Automated Daily Measurement with Open Data and Code2026-06-09T17:18:31ZThis manuscript presents and advocates for a new form of scientific communication: free and open nowcasting of public opinion via web dashboard. I present an open-source automated system that gathers new human responses to survey items daily, anonymizes and publicly distributes microdata, and presents analyses through a publicly viewable Web dashboard. A demonstration implementation tracked support for further development of artificial intelligence among American adults. As of 2026-05-31, the system had autonomously produced 766 daily estimates of support from N=8551 respondents. The findings underscore the need for continuous, high-frequency surveys to accurately track shifts in public opinion on transformative technologies like AI. I argue that more scientists should adopt the method of open nowcasting, because it encourages transparency in research design and eases replication.2024-12-06T16:27:05ZJason Jeffrey Joneshttp://arxiv.org/abs/2606.11116v1Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News2026-06-09T17:13:40ZAs newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.2026-06-09T17:13:40ZAccepted to CHIWORK Workshop (Interrogating GenAI Augmentation for CHIworkers: Strategies for Professional Autonomy and Accountability)Pooja Prajodhttp://arxiv.org/abs/2603.04689v4Generalizing Fair Top-$k$ Selection: An Integrative Approach2026-06-09T17:00:32ZFair top-$k$ selection, which ensures appropriate proportional representation of members from minority or historically disadvantaged groups among the top-$k$ selected candidates, has drawn significant attention. We study the problem of finding a fair (linear) scoring function with multiple protected groups while also minimizing the disparity from a reference scoring function. This generalizes the prior setup, which was restricted to the single-group setting without disparity minimization. Previous studies imply that the number of protected groups may have a limited impact on the runtime efficiency. However, driven by the need for experimental exploration, we find that this implication overlooks a critical issue that may affect the fairness of the outcome. Once this issue is properly considered, our hardness analysis shows that the problem may become computationally intractable even for a two-dimensional dataset and small values of $k$. However, our analysis also reveals a gap in the hardness barrier, enabling us to recover the efficiency for the case of small $k$ when the number of protected groups is sufficiently small. Furthermore, beyond measuring disparity as the "distance" between the fair and the reference scoring functions, we introduce an alternative disparity measure$\unicode{x2014}$utility loss$\unicode{x2014}$that may yield a more stable scoring function under small weight perturbations. Through careful engineering trade-offs that balance implementation complexity, robustness, and performance, our augmented two-pronged solution demonstrates strong empirical performance on real-world datasets, with experimental observations also informing algorithm design and implementation decisions.2026-03-05T00:06:47ZGuangya Caihttp://arxiv.org/abs/2606.11082v1The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models2026-06-09T16:42:00ZThis study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.2026-06-09T16:42:00Z25 pages, 2 figures, 6 tables, Research ArticleHakan Mehmetcikhttp://arxiv.org/abs/2606.11040v1Internet Quality Barometer (IQB): A preliminary data-driven evaluation of the IQB framework2026-06-09T16:09:44ZThe Internet Quality Barometer (IQB) framework was designed to transform raw Internet measurement data into actionable insights about Internet quality. Specifically, the framework maps raw speed test measurements to network requirements (e.g., throughput, latency), maps these requirements to representative Internet use cases (such as video streaming or web browsing), and finally aggregates performance across use cases into a single IQB score. The IQB score is a composite index ranging from 0 to 1, intended to capture overall Internet quality in a way that is both interpretable and comparable across locations. We implemented the IQB framework in practice by developing an open-source IQB library and a prototype web application. These tools enabled us to compute IQB scores at scale, including global estimates aggregated at the level of countries, regions, and cities. In this report we conduct a preliminary sensitivity analysis of the IQB framework, investigating how different parameter choices affect the resulting IQB scores, identifying which parameters the framework is most sensitive to, and highlighting cases that may lead to outliers or potentially misleading results.2026-06-09T16:09:44ZPavlos SermpezisZeynep Arslanhttp://arxiv.org/abs/2601.05232v3AI Application Gives Users Real-Time Feedback on the Level of Peace in the Social Media Videos They Watch2026-06-09T16:07:38ZMost people now get their news from videos on social media, such as YouTube and Facebook, rather than through curated journalism. "We become what we behold." The content and tone of language plays an essential role in starting or ending conflicts. "Hate Speech" can enhance conflict, "Peace Speech" can enhance peace. We developed an application that measures, in real time, these aspects of speech from YouTube videos, which can give users helpful feedback on their own media diet. We used two approaches: 1) supervised machine learning. Language in the text of online news media text was tagged by surveys that measure the level of peace in those countries. One fully connected feedforward and 2 convolutional neural networks trained on that data were $\sim 97\%$ accurate in predicting levels of peace in the test set and $\sim 70\%$ accurate in another distinct news text data set, but did not generalize to YouTube videos, suggesting that written text is different than transcribed spoken language. 2) social science dimensions. There is no similar external data to tag the text in the YouTube video transcripts. We therefore used 2 word-level sentiment analysis (SA) and 6 context-level large language models (LLMs) to measure 5 social dimensions in peace identified by 59 social science studies: compassion-contempt, news-opinion, promotion-prevention, creativity-order, nuance-simplification. LLMs more closely matched the values by 3 human coders on 52 videos, $r^2\sim0.60$ than SA, at $r^2\sim0.03$. Results: LLMs successfully measured social dimensions important in peace in YouTube videos, compared to human coders. These results serve as the basis of an analysis engine that can give users and content creators feedback on their own media diet and creations.2026-01-08T18:57:01Z6 pages, 4 figures, corrected typos, minor edits; v3: 16 pages, improved title, abstract, introduction, discussion, conclusions, added more referencesP. GildaColumbia UniversityP. DungarwalColumbia UniversityA. ThongkhamColumbia UniversityE. T. AjayiSt John's UniversityS. ChoudharyColumbia UniversityT. M. TerolColumbia UniversityC. LamColumbia UniversityJ. P. AraujoColumbia UniversityM. McFadyen-MungallnColumbia UniversityL. S. LiebovitchColumbia UniversityP. T. ColemanColumbia UniversityH. WestColumbia UniversityK. SieckToyota Research InstituteS. CarterToyota Research Institutehttp://arxiv.org/abs/2606.11009v1Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions2026-06-09T15:50:12ZLarge language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and reveal which cultural entities models treat as most salient. We analyze how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into Bengali, Hindi, Punjabi (India), Urdu, Sindhi (Pakistan), Italian, and Sicilian (Italy), a language set spanning the full resource spectrum, from high-resource Italian and Hindi to under-studied Sindhi, Sicilian, and Punjabi. We annotate 6,489 entity transformations, coding whether models preserve, localize, generalize, omit, or change entities such as names, foods, and places. Models agree on transformation type in 62.5% of cases and on specific substitutions in only 33.5%, meaning model choice directly shapes which cultural world students encounter. All 21 language-model combinations show entropy collapse, with adaptation compressing rather than expanding cultural diversity. Models prioritize surface markers such as names, foods, and currencies while preserving deeper structural features such as grade-level systems that embed culturally specific assumptions. Despite prompts specifying target countries, models misattribute regional context by using Bangladeshi taka for Indian Bengali students and produce cross-cultural contamination, such as adapting egg hunts as Eid activities. Some failures are visible in individual translations. Others, including diversity collapse, systematic preference for surface markers, and consistent regional misattribution, emerge only through corpus-level analysis. The surface plausibility that makes adapted problems look correct is precisely what makes deeper failures easy to overlook.2026-06-09T15:50:12Z17 pages total with references and appendix, 9 figures, under reviewParisa SuchdevJuniper Lovatohttp://arxiv.org/abs/2606.10997v1A Companion App for an Autonomous Family Vehicle: Identification of Values for an Autonomous Mobility System2026-06-09T15:33:26ZIn this paper, we present a companion app for an autonomous vehicle aimed at user groups who would normally require an accompanying person to drive them. Two aspects of a companion app are presented in this paper: First, the possibility for a trusted person to track the ride of the person in need of support and second, to put the settings of the vehicle for persons in need of support in the hands of a trusted person. In addition, this article describes the requirements and addressed values and discusses the safety-relevant aspects of such a companion app. We also discuss and identify the values that influence passengers and trusted persons using the companion app. Overall, a companion app can provide new perspectives and opportunities for people in need of support, allowing them to take advantage of the features offered by autonomous vehicles. It enables trusted individuals to configure the vehicle according to the passengers needs. Also such an app can be a mechanism to involve trusted persons in the options given by the vehicle and give them the possibility to adapt the vehicle to the needs of the person in need of support.2026-06-09T15:33:26ZAccepted to be published in the 2026 IEEE Intelligent Vehicles Symposium (IV)Leon Johann BrettinTobias SchräderKerstin KuhlmannVanessa SchmidtMarkus Maurer