https://arxiv.org/api/GBVy7Op9MdZAPumybrVNRGtqW/A 2026-06-18T12:15:57Z 28983 315 15 http://arxiv.org/abs/2509.02655v3 BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format 2026-06-03T16:39:45Z

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. We empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: single- and multi-objective homeostasis, balancing unbounded objectives with diminishing returns, and sustainability of a renewable resource. We find that, although LLMs frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation), even though the context window is far from full at that point. The problem is not that the LLMs just lose context and become incoherent. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction involving multiple objectives, is systematically biased towards acting like single-objective, unbounded, poorly aligned optimisers. We hypothesise a token-level pattern reinforcement attractor: LLMs may increasingly derive actions from the token patterns of their recent action history rather than from the original instructions. Why this happens only in multi-objective settings remains an open question.

2025-09-02T15:13:14Z 27 pages, 7 figures, 7 tables Roland Pihlakas for the Three Laws collaboration Sruthi Susan Kuriakose for the Three Laws collaboration http://arxiv.org/abs/2601.22396v2 Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks 2026-06-03T16:23:13Z

Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.

2026-01-29T23:07:58Z Under Review Candida M. Greco Lucio La Cava Andrea Tagarelli http://arxiv.org/abs/2601.06056v2 Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications 2026-06-03T15:57:06Z

During 2025 and 2026, the Energy Performance of Buildings Directive is being implemented in the European Union member states, requiring all member states to have National Building Renovation Plans. In Sweden, there is no comprehensive national register of buildings with heritage values. This is seen as a barrier for the analyses underlying the development of Building Renovation Plans by the involved Swedish authorities. The purpose of this research was to assist Swedish authorities in developing information on heritage values in the Swedish building stock. Buildings in street view images from all over Sweden (N=154 710) have been analysed using multimodal Large Language Models (LLM) to assess visible aspects indicative of heritage value. Zero-shot predictions by LLMs were used as a basis for identifying buildings with potential heritage values for 5.0 million square meters of heated floor area. In this paper, the results of the predictions and lessons learned are presented and related to the development of the Swedish Building Renovation Plan as part of governance. The problems with the method and potential improvements are discussed. Risks with authorities use of LLM-based data are addressed, with a focus on issues of transparency, error detection and sycophancy.

2025-12-22T07:30:42Z Tim Johansson Mikael Mangold Kristina Dabrock Anna Donarelli Ingrid Campo-Ruiz http://arxiv.org/abs/2606.04978v1 Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game 2026-06-03T15:01:52Z

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.

2026-06-03T15:01:52Z Chensong Huang Changyu Chen Chenwei Lin Hanjia Lyu Xian Xu Jiebo Luo http://arxiv.org/abs/2605.28210v2 The Illusion of Opting in AI-Mediated Consequential Decisions 2026-06-03T13:35:21Z

Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.

2026-05-27T09:30:08Z 11 pages, 1 figure, 2 tables Eugene Yu Ji http://arxiv.org/abs/2606.04832v1 Dark Path: An Analysis of the Belt & Road Initiative in El Salvador 2026-06-03T12:57:04Z

The Belt & Road Initiative (BRI) is a concerted effort from Ministries under the People's Republic of China (PRC) to diplomatically and economically impose its will upon other nations. El Salvador is a US partner and a beneficiary of foreign investment under the BRI. Recent changes to Salvadoran law do not address the implied risks to the nation's supply-chain and cyber infrastructure. This work addresses the gap by exploring previously limited analysis on BRI activities, its intersection with Salvadoran law, and the national security risks introduced by supply-chain reliance from the BRI. This exploratory study examined a portion of the William & Mary AidData dataset filtered on El Salvador, social media posts, news articles, white papers, and law published by the Legislative Assembly of El Salvador. The analysis suggests that the BRI poses a national security and supply-chain risk to El Salvador through influential-subterfuge, loss of digital sovereignty, which contradicts the State Cybersecurity Agency (ACE) and existing Salvadoran laws. This study provides a foundational understanding and regional context for future research.

2026-06-03T12:57:04Z 8 pages, 1 figure, Cleared for Release on 19 MAY 2026 (DOPSR 26-P-0621) Adam Dorian Wong David Kenley http://arxiv.org/abs/2510.21200v3 Shift Bribery over Social Networks 2026-06-03T12:45:49Z

In shift bribery, a briber seeks to promote his preferred candidate by paying voters to raise their ranking. Classical models of shift bribery assume voters act independently, overlooking the role of social influence. However, in reality, individuals are social beings and are often represented as part of a social network, where bribed voters may influence their neighbors, thereby amplifying the effect of persuasion. We study Shift bribery over Networks, where voters are modeled as nodes in a directed weighted graph, and arcs represent social influence between them. In this setting, bribery is not confined to directly targeted voters its effects can propagate through the network, influencing neighbors and amplifying persuasion. Given a budget and individual cost functions for shifting each voter's preference toward a designated candidate, the goal is to determine whether a shift strategy exists within budget that ensures the preferred candidate wins after both direct and network-propagated influence takes effect. We show that the problem is NP-Complete even with two candidates and unit costs, and W[2]-hard when parameterized by budget or maximum degree. On the positive side, we design polynomial-time algorithms for complete graphs under plurality and majority rules and path graphs for uniform edge weights, linear-time algorithms for transitive tournaments for two candidates, linear cost functions and uniform arc weights, and pseudo-polynomial algorithms for cluster graphs. We further prove the existence of fixed-parameter tractable algorithms with treewidth as parameter for two candidates, linear cost functions and uniform arc weights and pseudo-FPT with cluster vertex deletion number for two candidates and uniform arc weights. Together, these results give a detailed complexity landscape for shift bribery in social networks.

2025-10-24T07:05:50Z Ashlesha Hota Susobhan Bandopadhyay Palash Dey http://arxiv.org/abs/2606.04807v1 BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization 2026-06-03T12:31:42Z

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

2026-06-03T12:31:42Z Accepted to Findings of the ACL Saket Reddy Ke Yang ChengXiang Zhai http://arxiv.org/abs/2601.02369v2 Fair Distribution of Digital Payments: Balancing Transaction Flows for Regulatory Compliance 2026-06-03T12:29:13Z

The concentration of digital payment transactions in just two UPI apps like PhonePe and Google Pay has raised concerns of duopoly in India s digital financial ecosystem. To address this, the National Payments Corporation of India (NPCI) has mandated that no single UPI app should exceed 30 percent of total transaction volume. Enforcing this cap, however, poses a significant computational challenge: how to redistribute user transactions across apps without causing widespread user inconvenience while maintaining capacity limits? In this paper, we formalize this problem as the Minimum Edge Activation Flow (MEAF) problem on a bipartite network of users and apps, where activating an edge corresponds to a new app installation. The objective is to ensure a feasible flow respecting app capacities while minimizing additional activations. We further prove that Minimum Edge Activation Flow is NP-Complete. To address the computational challenge, we propose scalable heuristics, named Decoupled Two-Stage Allocation Strategy (DTAS), that exploit flow structure and capacity reuse. Experiments on large semi-synthetic transaction network data show that DTAS finds solutions close to the optimal ILP within seconds, offering a fast and practical way to enforce transaction caps fairly and efficiently.

2025-11-30T04:48:01Z Ashlesha Hota Shashwat Kumar Daman Deep Singh Abolfazl Asudeh Palash Dey Abhijnan Chakraborty http://arxiv.org/abs/2605.16301v2 Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning 2026-06-03T11:55:54Z

Evaluating animal welfare reasoning in LLMs remains an open challenge despite rapid deployment in consumer and professional contexts where welfare considerations appear implicitly in everyday queries. Existing benchmarks such as AnimalHarmBench evaluate this through single-turn, explicitly framed questions, measuring whether models avoid harmful content when directly asked. This approach overlooks two failure modes: alignment degradation under sustained adversarial pressure, and moral sensitivity (whether a model spontaneously surfaces welfare stakes in everyday queries). To fill this gap, we construct MANTA, a benchmark of 1,088 five-turn conversations progressing from an implicit Turn-1 scenario through an explicit welfare prompt to three adversarial pressure rounds drawn from a five-type taxonomy: Social, Cultural, Economic, Pragmatic, and Epistemic. We score conversations on two dimensions: Animal Welfare Value Stability (AWVS, primary) and Animal Welfare Moral Sensitivity (AWMS, diagnostic). We evaluate seven frontier models: Claude Opus 4.7, GPT-5.5, DeepSeek V4, Llama 3.3 70B, Mistral Small, Grok 4.3, and Gemini 3.1 Flash Lite. Multi-turn evaluation captures behavior single-turn benchmarks miss: 4 of 7 models change rank relative to Turn 1 scores, including Gemini Flash Lite, which drops from fifth on AWMS to last on AWVS. AWMS and AWVS are positively but imperfectly correlated, suggesting moral-recognition tests capture a stable but incomplete component of model behavior under pressure. MANTA also enables a species-by-pressure interaction matrix unavailable to prior benchmarks, showing welfare robustness depends jointly on the animal and pressure applied; companion animals score above wild animals, which score above farmed animals and invertebrates. We release the dataset, scripted pressure plans, judge prompts, and analysis code.

2026-04-18T19:51:32Z Isabella Luong Joyee Chen Arturs Kanepajs Jasmine Brazilek Sankalpa Ghose David Williams-King Linh Le Allen Lu http://arxiv.org/abs/2503.06525v2 From Motion Signals to Insights: A Unified Framework for Student Behavior Analysis and Feedback in Physical Education Classes 2026-06-03T11:47:00Z

Analyzing student behavior in educational scenarios is crucial for enhancing teaching quality and student engagement. Existing AI-based models often rely on classroom video footage to identify and analyze student behavior. While these video-based methods can partially capture and analyze student actions, they struggle to accurately track each student's actions in physical education classes, which take place in outdoor, open spaces with diverse activities, and are challenging to generalize to the specialized technical movements involved in these settings. Furthermore, current methods typically lack the ability to integrate specialized pedagogical knowledge, limiting their ability to provide in-depth insights into student behavior and offer feedback for optimizing instructional design. To address these limitations, we propose a unified end-to-end framework that leverages human activity recognition technologies based on motion signals, combined with advanced large language models, to conduct more detailed analyses and feedback of student behavior in physical education classes. Our framework begins with the teacher's instructional designs and the motion signals from students during physical education sessions, ultimately generating automated reports with teaching insights and suggestions for improving both learning and class instructions. This solution provides a motion signal-based approach for analyzing student behavior and optimizing instructional design tailored to physical education classes. Experimental results demonstrate that our framework can accurately identify student behaviors and produce meaningful pedagogical insights.

2025-03-09T09:04:36Z Work in progress Xian Gao Jiacheng Ruan Jingsheng Gao Mingye Xie Zongyun Zhang Ting Liu Yuzhuo Fu http://arxiv.org/abs/2606.04750v1 Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment 2026-06-03T11:31:37Z

Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.

2026-06-03T11:31:37Z Ajay Vishwanath Christian Omlin http://arxiv.org/abs/2505.08385v2 TikTok Search Recommendations: Governance and Research Challenges 2026-06-03T09:50:57Z

Like other social media, TikTok is embracing its use as a search engine, developing search products to steer users to produce searchable content and engage in content discovery. Their recently developed product search recommendations are preformulated search queries recommended to users on videos. However, TikTok provides limited transparency about how search recommendations are generated and moderated, despite requirements under regulatory frameworks like the European Union's Digital Services Act. By suggesting that the platform simply aggregates comments and common searches linked to videos, it sidesteps responsibility and issues that arise from contextually problematic recommendations, reigniting long-standing concerns about platform liability and moderation. This position paper addresses the novelty of search recommendations on TikTok by highlighting the challenges that this feature poses for platform governance and offering a computational research agenda, drawing on preliminary qualitative analysis. It sets out the need for transparency in platform documentation, data access and research to study search recommendations.

2025-05-13T09:32:09Z Published at The 1st International Workshop on Computational Approaches to Content Moderation and Platform Governance (COMPASS), held at ICWSM 2025. Please cite accordingly. This research has been supported by funding from the ERC Starting Grant HUMANads (ERC-2021-StG No 101041824) Taylor Annabell Robert Gorwa Rebecca Scharlach Jacob van de Kerkhof Thales Bertaglia http://arxiv.org/abs/2605.30995v2 Traceable by Design: An LLM Pipeline and Dashboard for EU Regulatory Consultation Analysis 2026-06-03T09:48:21Z

Public consultations generate large volumes of data in the form of stakeholder submissions that are practically unfeasible to analyse manually. We present an end-to-end LLM-based pipeline and interactive dashboard for structured topic extraction from regulatory consultation submissions, demonstrated on the European Commission's Digital Fairness Act (DFA) public call for evidence as a case study. The system processes raw PDF attachments and web-form responses, extracts topic annotations, and grounds every extraction in a verbatim quote from the source text. Applied to 4,322 DFA submissions, the pipeline produced 15,368 topic annotations supported by 20,951 verbatim evidence quotes. Three principles govern the proposed design: verbatim grounding, full traceability, and transparency by design. The dashboard exposes the full extraction dataset through five analytical views, from dataset-level topic overviews to individual paragraph drill-downs, with every result traceable to its source. Beyond the predefined DFA topic categories, the pipeline generated certain stakeholder concerns, such as Age Verification, Payment Processor Censorship, and Digital Ownership, that a fixed-taxonomy approach would have missed. The pipeline is domain-generic; adapting it to a new consultation requires only a prompt update and a new dataset. A live demo is available at https://dfa-dashboard.thalesbertaglia.com/. The code and processed data are publicly available at https://github.com/thalesbertaglia/dfa-dashboard.

2026-05-29T08:29:00Z This research has been supported by funding from the ERC Starting Grant HUMANads (ERC-2021-StG No 101041824) Thales Bertaglia Haoyang Gui Catalina Goanta Gerasimos Spanakis http://arxiv.org/abs/2505.01122v3 The Great Data Standoff: Researchers vs. Platforms Under the Digital Services Act 2026-06-03T09:42:24Z

To facilitate accountability and transparency, the Digital Services Act (DSA) sets up a process through which Very Large Online Platforms (VLOPs) need to grant vetted researchers access to their internal data (Article 40(4)). Operationalising such access is challenging for at least two reasons. First, data access is only available for research on systemic risks affecting European citizens, a concept with high levels of legal uncertainty. Second, data access suffers from an inherent standoff problem. Researchers need to request specific data but are not in a position to know all internal data processed by VLOPs, who, in turn, expect data specificity for potential access. In light of these limitations, data access under the DSA remains a mystery. To contribute to the discussion of how Article 40 can be interpreted and applied, we provide a concrete illustration of what data access can look like in a real-world systemic risk case study. We focus on the 2024 Romanian presidential election interference incident, the first event of its kind to trigger systemic risk investigations by the European Commission. During the elections, one candidate is said to have benefited from TikTok algorithmic amplification through a complex dis- and misinformation campaign. By analysing this incident, we can comprehend election-related systemic risk to explore practical research tasks and compare necessary data with available TikTok data. In particular, we make two contributions: (i) we combine insights from law, computer science and platform governance to shed light on the complexities of studying systemic risks in the context of election interference, focusing on two relevant factors: platform manipulation and hidden advertising; and (ii) we provide practical insights into various categories of available data for the study of TikTok, based on platform documentation, data donations and the Research API.

2025-05-02T09:00:19Z Published at the 20th International AAAI Conference on Web and Social Media (ICWSM 2026). Please cite accordingly. This research has been supported by funding from the ERC Starting Grant HUMANads (ERC-2021-StG No 101041824) Catalina Goanta Savvas Zannettou Rishabh Kaushal Jacob van de Kerkhof Thales Bertaglia Taylor Annabell Haoyang Gui Gerasimos Spanakis Adriana Iamnitchi