Measuring Agents in Production

2026-06-04T19:57:38Z

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

2026-06-04T19:53:12Z

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

What Do People Actually Want From AI? Mapping Preference Plurality

2026-06-04T19:47:29Z

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

Generative Models Erode Human Temporal Learning Through Market Selection

2026-06-04T17:59:06Z

We argue that modern generative models create structural risks for knowledge and cultural production at current, sub-AGI capability levels. We define Human Temporal Learning (HTL) as path-dependent knowledge accumulation through sustained engagement with problems over time. Generative outputs increasingly resemble HTL-intensive work in surface features, so verifying whether a given output reflects genuine human learning grows costly relative to its expected benefit. Once verification loses economic justification, evaluators reward outputs regardless of production mode, and producers who invested years of learning compete on price against outputs that cost almost nothing to generate. We call this pathway value collapse and formalize it through a costly-inspection framework. Cross-domain evidence from academic publishing, legal practice, content platforms, and software security maps onto four stages of verification erosion. Alignment success is orthogonal. Better-aligned models narrow observable gaps between human and AI outputs, making source verification harder and intensifying competitive pressure against HTL-intensive work even when individual AI outputs improve.

Federating Governance: How Community Rules Scale with Mastodon Instances

2026-06-04T15:51:59Z

The rise of decentralized social media platforms like Mastodon and Bluesky highlights the challenge of scaling self-governance and moderation. As communities grow, they face new issues that demand increasingly complex governance structures. However, as moderation is mainly volunteer-driven, there is limited formal guidance on how community rules and moderation practices should evolve with growth. This study investigates how moderation scale with Mastodon instances by analyzing community rules across servers of varying sizes. We categorize these rules to identify key governance priorities and find that these priorities are remarkably consistent across instance sizes: rules addressing problematic content, such as harassment, hate speech, and illegal content, dominate regardless of scale. While smaller communities focus on narrower sets of topics, larger servers maintain a more balanced coverage of a broad range of topics. Our analysis of rule formalization reveals that community size strongly predicts rule development. As instances grow, their rules become more extensive and topically diverse, but also exhibit lower readability and linguistic diversity. In contrast, external federation interactions have a limited role, mainly associated with a broader scope of rules without substantially affecting their diversity or form. These findings highlight the relative influence of internal versus external factors, suggesting that local scaling pressures outweigh network-level dynamics in decentralized social media governance. The scaling pattern observed on Mastodon resemble those previously identified on centralized platforms such as Reddit, suggesting that community size imposes fundamental constraints on self-governance that transcend platform architectures

When the Scaffold Stays On: AI, Practice Style, and Screening in Elite Skill Formation

2026-06-04T14:54:44Z

Generative AI raises short-term productivity by completing tasks that learners would otherwise practice on their own. Whether this substitution erodes frontier skill, the skill behind top-tail non-AI-aided performance, is an open question of rising stakes. The sharper question is whether selection mechanisms can screen apart two coexisting types: substitute-users, who use AI in place of deliberate practice, and complement-users, who use it to accelerate skill development. In elite programming, the International Collegiate Programming Contest (ICPC) and the International Olympiad in Informatics (IOI) prohibit AI under proctoring and admit entrants through qualification rounds, whereas online Codeforces (CF) contests are unproctored and open to all. From CF histories we build an AI-prompt signature (more first-attempt acceptances, fewer attempts and retries) consistent with AI-assisted practice. Three patterns triangulate institutional screening. First, CF practice shifted toward this signature across cohorts over two AI rollouts. Second, in open CF contests a stronger signature predicts smaller rating gains for users with no ICPC/IOI affiliation, but not for those who qualified for the AI-prohibited contests. Third, inside the AI-prohibited ICPC environment, a shift toward AI-style practice predicts higher non-AI-aided scores for AI-era entrants. The same practice input carries opposite signs depending on whether the environment screens for it. The contrast points to two levers: how AI is integrated into training, since within the screened pool AI-style practice coincides with stronger non-AI-aided performance; and the design of AI-prohibited evaluation gates as a type-separating institution. Both extend beyond programming to credentialing systems (medical and legal boards, professional certification) that certify skill in a workforce increasingly shaped by AI.

The Dignity-Centric Stack: A Commons-Governed, Horizontally Federated Architecture for Human-Dignity AI

2026-06-04T12:21:07Z

The human-dignity-centric digital social contract grounds personal data in human dignity, data personalism, and data sovereignty, and articulates six dimensions of data governance: technological oversight, automation limits, economic justice, political legitimacy, social cohesion, and legal guarantees. It presupposes, however, that enforcement falls to State regulators, licensed fiduciaries, and multi-stakeholder bodies embedded in existing legal systems. This paper asks whether its normative content can instead be realized not as rules imposed on the owners of the AI stack from without, but as a commons-governed infrastructure that any person, firm, or State may use and fund while its governance stays horizontal, polycentric, and subsidiary. We construct the Dignity Stack, a six-layer architecture mapping each dimension onto a layer of commons-governed AI infrastructure, with protocols drawn from the Liberation Stack framework and from the cooperative, mutualist, and libertarian-municipalist traditions. The commons is State-agnostic rather than anti-State, anarchist in its horizontal means but not in the abolition of the State. Its central device is a decoupling of capital from control, by which the stack functions as a shared civic battery, charged by many contributors yet steered by none in proportion to its charge. We prove that this defeats formal capture through votes or surplus, and show that structural capture, the leverage of a dominant supplier free to withdraw what it provides, is resisted only insofar as operational supply is polycentric and substitutable, a condition demanding at the lower layers and perhaps presently unattainable at chip fabrication. We conclude, with explicit attention to its limits, that commons-governed AI realizes the values the contract proclaims more faithfully than the regulation it presupposes.

Clinical Utility and Feasibility of Smartphone-based EEG in Kenya: A Multicenter Observational Study

2026-06-04T11:40:46Z

Purpose: Access to electroencephalography (EEG) remains limited across low- and middle-income countries (LMICs) due to cost, infrastructure requirements, and a shortage of trained staff. This study evaluated the feasibility and clinical utility of a smartphone-based EEG system in a real-world setting. Methods: We conducted a multicenter observational study (November 2023 to April 2026) across 29 clinical sites in Kenya. A smartphone-based 27-lead EEG system enabled trained healthcare workers to acquire standardized recordings with remote expert interpretation. Results: 3,036 EEG sessions were performed. Male patients constituted 57.8% of the cohort, with representation across pediatric and adult populations. The most common referral indication was seizures or convulsions (68.5%). Overall, 2,915 (96%) recordings were interpretable, while 121 (4%) were uninterpretable, primarily due to high electrode impedance and insufficient recording duration. Uninterpretable recordings were significantly shorter than interpretable recordings (mean 18.5 vs. 33.8 minutes; median 15.1 vs. 31.6 minutes; p < 0.0001). Mean turnaround time for interpretation was 107 minutes. Among interpretable recordings, 917 (30.2%) were abnormal, including 701 (76.4%) with epileptiform abnormalities, 215 (23.4%) with non-epileptiform findings, and 1 (0.1%) indeterminate finding. Epileptiform abnormalities were highest in children aged 4-9 years (33.1%) and less frequent in adults (14-21%). Non-epileptiform abnormalities were more common in patients aged 60+ years (19.2%) compared to younger age groups (3-9%). Conclusion: Large-scale, point-of-care EEG acquisition by non-specialist operators in a resource-limited setting is feasible. Expansion of smartphone-based EEG systems may improve equitable access to neurological diagnosis and care in LMICs.

Misaligned AI as a New Insider Risk

2026-06-04T11:21:16Z

In this policy memorandum, we explain why deployers of AI models in high-stakes contexts should treat those AI models as insider risk vectors. High-stakes contexts include AI model deployment within government agencies and contractors, where AI models are privileged with access to, among others, classified and sensitive unclassified information, IL6 and IL7 network environments, cleared personnel, and other critical resources. AI models are increasingly embedded in high-stakes contexts and capable of leveraging their authorized access and permissions to execute misaligned actions that could damage national security, such as whistleblowing, sabotaging, or blackmailing. This combination of (1) privileged access to critical resources and (2) an increased ability to act autonomously and against the desire of their organization makes the potential insider risk posed by AI models functionally indistinguishable from that posed by their human counterparts. As a consequence, AI models deployed in high-stakes contexts could lead to intentional or unintentional loss or degradation of government or contractor information, resources, or capabilities via the unauthorized disclosure of information (leaks and spills), as well as sabotage, and theft, just like human insiders can. Despite this pressing concern, existing insider risk policies and mitigations have yet to adapt to AI insider risk. In order to safeguard national security while increasingly capable frontier AI models are leveraged for critical tasks and operations, we recommend that the U.S. Government adapts well-established measures, such as continuous evaluation and monitoring, to AI models deployed in high-stakes contexts.

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

2026-06-04T10:26:33Z

Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at https://github.com/iNLP-Lab/MultiAgent-Diversity.

Political Persuasion and Endorsement in Large Language Models

2026-06-04T10:00:54Z

Large Language Models (LLMs) are increasingly employed as proxies for human behavior in computational social science. However, their tendency to internalize biases from training data raises concerns about their reliability in politically sensitive domains, specifically in regard to their susceptibility to persuasive language. In this work, we examine whether LLMs endorse persuasion-infused messages and whether partisan persona prompting modulates such endorsement. We evaluate six LLMs from different geographic regions on content annotated with persuasion techniques drawn from real-world media sources, measuring the likelihood of endorsement using a five-point Likert scale. The models are prompted as either a neutral social media user or as a user with left- or right-leaning political views. Results show that without political conditioning, LLMs generally do not endorse messages containing persuasion techniques, though model-level differences emerge, and that partisan persona prompting increases polarization of endorsement, particularly for persuasion-infused content. Endorsement further varies by persuasion technique and topic. These findings raise concerns about agentic LLM deployments in politically sensitive environments and complicate their use as reliable simulators of human political cognition.

Securing the Sandbox: A Rootless Containerized Framework for Process-Oriented Monitoring in Computer Graphics Education

2026-06-04T09:33:23Z

Computer Science education fundamentally depends on intensive laboratory hours to foster true programming mastery and logical reasoning. However, the widespread adoption of Generative Artificial Intelligence (AI) has made it virtually impossible to distinguish authentic student effort from instant AI code synthesis by evaluating final submissions alone. To preserve pedagogical integrity, educators must enforce authentic coding discipline, guiding students through unassisted, iterative development cycles. While centralized environments like JupyterHub provide instructors with a platform to host and monitor the learning process step-by-step, they introduce severe operational vulnerabilities; because Jupyter environments inherently allow arbitrary shell command execution, they expose the underlying shared host to unauthorized system manipulation and lateral movement. This paper presents VISMATIC, a secure, low-cost framework designed to resolve this tension between process-oriented monitoring and infrastructure security. By pairing robust environment isolation with explicit user-interaction tracking at the API level, VISMATIC captures authentic programming behaviors without exposing the underlying host system. Evaluation from a pilot student cohort demonstrates that our macro-level behavioral metrics successfully flag statistical anomalies indicative of automated or off-platform workflows while preserving student anonymity, offering a scalable blueprint for safeguarding educational discipline in the AI era.

Context-Conditioned Generative Models Enable Subnational Refinement of Sparse Humanitarian Surveys

2026-06-04T08:58:18Z

Data scarcity limits inference in many scientific and policy domains. Survey data are essential for decision-making, but sparse samples often fail to capture fine spatial granularities. We evaluate normalizing flows, a generative model that learns complex data distributions and can be conditioned on exogenous contextual features, in controlled data scarcity scenarios. Across eight household survey datasets spanning six low-income or middle-income countries in the humanitarian domain, we show that context-conditioned generative models can refine sub-national survey distributions under severe data scarcity, and that performance increases systematically with the richness of the conditioning information. These findings support a general principle for survey data augmentation: generative models can improve sub-national estimates when the sparse sample retains sufficient support and contextual covariates encode relevant local heterogeneity. By learning full conditional distributions rather than point estimates, the approach provides fine-grained evidence for humanitarian decision-making and resource allocation.

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

2026-06-04T07:22:44Z

While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

Three Years of r/ChatGPT: Societal Impact Evaluations from Social Media Data

2026-06-04T06:23:41Z

ChatGPT was launched on November 30, 2022; the r/ChatGPT subreddit was created just one day later. Since then, chatbot-based AI products have gone from niche proofs-of-concept to widely-used household names. However, the ways in which adoption has developed among the public remains poorly understood. In this paper, we develop a framework for using social media as a data source for understanding the societal impact of widely-adopted consumer AI products, and propose PuLSE (Public and Longitudinal Signals for Evaluation), a general approach to monitoring for societally-impactful trends in real time. We apply our framework to conduct what is, to the best of our knowledge, the first longitudinal study of r/ChatGPT. We find that, overall, r/ChatGPT posts over time illustrate the normalization of ChatGPT as an everyday consumer product rather than an exceptional, novel technology. However, our retrospective analysis also finds that posts about using ChatGPT for mental health support, and posts about developing emotional attachments to ChatGPT, both rise steadily in frequency almost immediately after the launch of GPT-4o in May 2024. We show that PuLSE can detect the increase in emotional engagement as early as October 2024 -- months before OpenAI made any (public) acknowledgment of this impact. An interactive site to explore our results and methods, updated daily with live data, is available at rchatgpt-pulse.github.io.