https://arxiv.org/api/9AI6sfYzrzZEZglV2b1yCJZWGcs 2026-06-13T20:44:04Z 28886 150 15 http://arxiv.org/abs/2606.06851v1 Toward a Metaphysics of Learning Analytics: Ontological Positioning of Data, Inference, and Normativity 2026-06-05T02:51:14Z The Learning Analytics (LA) community has undergone rapid development over the 15 years since the first LAK conference was held. However, while epistemological and ethical debates regarding the philosophical foundations of LA have been vigorous, metaphysical discussions have been sparse, signifying a lack of effort to derive the identity of LA from its internal principles. In this paper, we attempt to establish a metaphysics of LA by addressing the ontological question of ``What is LA?'' We do so by tracing back to LA's own definitions and principles to derive an answer from within LA itself. Specifically, we address what kind of existence the data LA operates on constitutes, identify eight agents including learners as ontological prerequisites, and clarify, via the is/ought problem, that LA does not derive norms from data. In particular, this system reveals that a class of LA practices, here termed \textit{norm-embedded LA}, conflates LA's purpose with its operations, creating an ontological tension with the first principle. We also discuss connections with related fields and the limitations of this system. The metaphysics outlined here is not imposed from outside LA, but surfaces what LA itself has always implicitly presupposed. 2026-06-05T02:51:14Z 25 pages, 1 figures Kensuke Takii http://arxiv.org/abs/2606.06830v1 Learning Fair Demand Models 2026-06-05T02:12:07Z Data-driven pricing is increasingly prevalent in sectors such as airlines, lending, insurance, and retail. By learning demand models from customer features and setting prices accordingly, these systems may generate discriminatory outcomes that raise fairness concerns. This leads to fundamental questions - how and where should systems incorporate fairness considerations in the pricing pipeline, and how does it ultimately affect societal outcomes? To answer these, we study a stylized model where a seller has a two-stage decision pipeline comprising linear demand model estimation followed by price optimization. The seller considers fairness notions in training loss, price, and demand, under both parity-wise and Rawlsian perspectives. We show that equalizing training loss across consumer groups leads to multiple solutions, which in turn can result in undesirable outcomes despite being a standard approach in fair machine learning. Focusing instead on fairness applied directly to prices or demand, we compare two strategies that enforce fairness in either the demand estimation stage or the price optimization stage. For parity-wise fairness, we characterize when each strategy yields higher social welfare under small fairness levels. We show that when market sizes and prices in the dataset are similar, imposing price fairness in the estimation stage is more beneficial to consumers, whereas imposing demand fairness in the optimization stage yields better consumer outcomes. For Rawlsian fairness, the two strategies coincide exactly. Lastly, we extend our model to alternate demand functions and conduct a case study using real-world vaccine pricing data. 2026-06-05T02:12:07Z Adam N. Elmachtoub Hyemi Kim Jonathan Y. Tan http://arxiv.org/abs/2606.06784v1 What Your Posts Reveal: A Benchmark and Agentic Framework for User-Level Privacy Leakage on Social Media 2026-06-05T00:02:47Z Public social media posts can reveal private information through weak cues scattered across text, images, or metadata. Such leakage is often cumulative and cross-post: cues that appear harmless in isolation may jointly expose a user's home, workplace, or routine. However, current research lacks a unified benchmark for user-level multimodal privacy leakage and an evaluation metric that captures exposure severity beyond binary accuracy. To address these gaps, we propose SopriBench, a synthetic benchmark guided by leakage patterns abstracted from a private reference corpus of Rednote and Instagram accounts, covering 50 user profiles and 1,569 images with attributes, contextual sensitivity, granularity, leakage type, inference difficulty, and supporting evidence. We further introduce the Privacy Exposure Score (PES), which weights value granularity by contextual sensitivity. Inspired by abductive reasoning, we introduce Argus, a training-free agentic framework for cumulative leakage inference. Argus forms hypotheses from accumulated evidence, verifies supporting evidence, and aggregates cross-post cues into privacy profiles, achieving 0.55 PES, a 25% improvement over the strongest baseline, with the largest gain on cross-post leakage. 2026-06-05T00:02:47Z Zifan Peng Yini Huang Aiwen Lu Qiming Ye Peixian Zhang Jingyi Zheng Yule Liu Xuechao Wang Xinlei He Jiaheng Wei http://arxiv.org/abs/2606.06694v1 The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search 2026-06-04T20:17:58Z Large language models (LLMs) are rapidly assuming an intermediary role in housing search through the integration of listing platforms within conversational interfaces, mediating access to information, search, and recommendations within urban settings. We expand on prior work on racial steering in LLMs by conducting a behavioral audit of seven open-weight and closed-source LLMs across four U.S. cities, testing location recommendations across three iterative prompting conditions that progressively add lifestyle preference context and reflect fair housing paired-testing methodologies. We find that steering is an emergent behavior of the model's interpretive license rather than primarily a static property. Steering results from the interaction of a user's identity, preference articulation, and the spatial logic that a model has internalized about learned representations of place, preference, and opportunity in a given city, and how different types of users relate to it. While steering was present, it was not uniform in direction or magnitude across evaluated conditions. Preference-conditioned testing often increased or reconfigured the number of models that exhibited steering behaviors relative to baseline conditions, suggesting that LLMs may interpret what the same housing preference means differently depending on the racial identity of the user. Our findings also demonstrate that the city is not a neutral testing unit for LLM evaluation in place-based sectors, and results from one local market cannot be assumed to generalize to another. Local and domain expertise will be required in the housing sector to ensure that legal and institutional commitments to fair housing are not undermined while adopting AI tools that mediate spatial access. 2026-06-04T20:17:58Z 13 pages with supplemental tables and figures, AIES '26 Submission Hana Samad Trung Lam Christoph Mügge-Durum Michael Akinwumi http://arxiv.org/abs/2512.04123v4 Measuring Agents in Production 2026-06-04T19:57:38Z LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues. 2025-12-02T16:45:10Z Accepted to the 43rd International Conference on Machine Learning (ICML 2026) as Oral Presentation Melissa Z. Pan Negar Arabzadeh Riccardo Cogo Yuxuan Zhu Alexander Xiong Lakshya A Agrawal Huanzhi Mao Emma Shen Sid Pallerla Liana Patel Shu Liu Tianneng Shi Xiaoyuan Liu Jared Quincy Davis Emmanuele Lacavalla Alessandro Basile Shuyi Yang Paul Castro Daniel Kang Koushik Sen Dawn Song Joseph E. Gonzalez Ion Stoica Matei Zaharia Marquita Ellis http://arxiv.org/abs/2606.06679v1 HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule 2026-06-04T19:53:12Z Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge. 2026-06-04T19:53:12Z Xi Xuan Wenxin Zhang Yufei Zhou King-kui Sin Chunyu Kit http://arxiv.org/abs/2606.06674v1 What Do People Actually Want From AI? Mapping Preference Plurality 2026-06-04T19:47:29Z Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence. 2026-06-04T19:47:29Z Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26) Julia Sepúlveda Coelho Scott A. Hale 10.1145/3805689.3812398 http://arxiv.org/abs/2606.06572v1 Generative Models Erode Human Temporal Learning Through Market Selection 2026-06-04T17:59:06Z We argue that modern generative models create structural risks for knowledge and cultural production at current, sub-AGI capability levels. We define Human Temporal Learning (HTL) as path-dependent knowledge accumulation through sustained engagement with problems over time. Generative outputs increasingly resemble HTL-intensive work in surface features, so verifying whether a given output reflects genuine human learning grows costly relative to its expected benefit. Once verification loses economic justification, evaluators reward outputs regardless of production mode, and producers who invested years of learning compete on price against outputs that cost almost nothing to generate. We call this pathway value collapse and formalize it through a costly-inspection framework. Cross-domain evidence from academic publishing, legal practice, content platforms, and software security maps onto four stages of verification erosion. Alignment success is orthogonal. Better-aligned models narrow observable gaps between human and AI outputs, making source verification harder and intensifying competitive pressure against HTL-intensive work even when individual AI outputs improve. 2026-06-04T17:59:06Z Accepted at ICML 2026 Forty-third International Conference on Machine Learning Position Paper Track (2026) Wenjun Cao http://arxiv.org/abs/2606.05069v2 Federating Governance: How Community Rules Scale with Mastodon Instances 2026-06-04T15:51:59Z The rise of decentralized social media platforms like Mastodon and Bluesky highlights the challenge of scaling self-governance and moderation. As communities grow, they face new issues that demand increasingly complex governance structures. However, as moderation is mainly volunteer-driven, there is limited formal guidance on how community rules and moderation practices should evolve with growth. This study investigates how moderation scale with Mastodon instances by analyzing community rules across servers of varying sizes. We categorize these rules to identify key governance priorities and find that these priorities are remarkably consistent across instance sizes: rules addressing problematic content, such as harassment, hate speech, and illegal content, dominate regardless of scale. While smaller communities focus on narrower sets of topics, larger servers maintain a more balanced coverage of a broad range of topics. Our analysis of rule formalization reveals that community size strongly predicts rule development. As instances grow, their rules become more extensive and topically diverse, but also exhibit lower readability and linguistic diversity. In contrast, external federation interactions have a limited role, mainly associated with a broader scope of rules without substantially affecting their diversity or form. These findings highlight the relative influence of internal versus external factors, suggesting that local scaling pressures outweigh network-level dynamics in decentralized social media governance. The scaling pattern observed on Mastodon resemble those previously identified on centralized platforms such as Reddit, suggesting that community size imposes fundamental constraints on self-governance that transcend platform architectures 2026-06-03T16:29:45Z Accepted to CSCW 2026 at Salt Lake City, Utah, USA Rasika Muralidharan Yong-Yeol Ahn Bao Tran Truong http://arxiv.org/abs/2606.06253v1 When the Scaffold Stays On: AI, Practice Style, and Screening in Elite Skill Formation 2026-06-04T14:54:44Z Generative AI raises short-term productivity by completing tasks that learners would otherwise practice on their own. Whether this substitution erodes frontier skill, the skill behind top-tail non-AI-aided performance, is an open question of rising stakes. The sharper question is whether selection mechanisms can screen apart two coexisting types: substitute-users, who use AI in place of deliberate practice, and complement-users, who use it to accelerate skill development. In elite programming, the International Collegiate Programming Contest (ICPC) and the International Olympiad in Informatics (IOI) prohibit AI under proctoring and admit entrants through qualification rounds, whereas online Codeforces (CF) contests are unproctored and open to all. From CF histories we build an AI-prompt signature (more first-attempt acceptances, fewer attempts and retries) consistent with AI-assisted practice. Three patterns triangulate institutional screening. First, CF practice shifted toward this signature across cohorts over two AI rollouts. Second, in open CF contests a stronger signature predicts smaller rating gains for users with no ICPC/IOI affiliation, but not for those who qualified for the AI-prohibited contests. Third, inside the AI-prohibited ICPC environment, a shift toward AI-style practice predicts higher non-AI-aided scores for AI-era entrants. The same practice input carries opposite signs depending on whether the environment screens for it. The contrast points to two levers: how AI is integrated into training, since within the screened pool AI-style practice coincides with stronger non-AI-aided performance; and the design of AI-prohibited evaluation gates as a type-separating institution. Both extend beyond programming to credentialing systems (medical and legal boards, professional certification) that certify skill in a workforce increasingly shaped by AI. 2026-06-04T14:54:44Z 58 pages, 4 figures Song Yao http://arxiv.org/abs/2606.06083v1 The Dignity-Centric Stack: A Commons-Governed, Horizontally Federated Architecture for Human-Dignity AI 2026-06-04T12:21:07Z The human-dignity-centric digital social contract grounds personal data in human dignity, data personalism, and data sovereignty, and articulates six dimensions of data governance: technological oversight, automation limits, economic justice, political legitimacy, social cohesion, and legal guarantees. It presupposes, however, that enforcement falls to State regulators, licensed fiduciaries, and multi-stakeholder bodies embedded in existing legal systems. This paper asks whether its normative content can instead be realized not as rules imposed on the owners of the AI stack from without, but as a commons-governed infrastructure that any person, firm, or State may use and fund while its governance stays horizontal, polycentric, and subsidiary. We construct the Dignity Stack, a six-layer architecture mapping each dimension onto a layer of commons-governed AI infrastructure, with protocols drawn from the Liberation Stack framework and from the cooperative, mutualist, and libertarian-municipalist traditions. The commons is State-agnostic rather than anti-State, anarchist in its horizontal means but not in the abolition of the State. Its central device is a decoupling of capital from control, by which the stack functions as a shared civic battery, charged by many contributors yet steered by none in proportion to its charge. We prove that this defeats formal capture through votes or surplus, and show that structural capture, the leverage of a dominant supplier free to withdraw what it provides, is resisted only insofar as operational supply is polycentric and substitutable, a condition demanding at the lower layers and perhaps presently unattainable at chip fabrication. We conclude, with explicit attention to its limits, that commons-governed AI realizes the values the contract proclaims more faithfully than the regulation it presupposes. 2026-06-04T12:21:07Z Eduardo C. Garrido-Merchán http://arxiv.org/abs/2605.08157v3 Clinical Utility and Feasibility of Smartphone-based EEG in Kenya: A Multicenter Observational Study 2026-06-04T11:40:46Z Purpose: Access to electroencephalography (EEG) remains limited across low- and middle-income countries (LMICs) due to cost, infrastructure requirements, and a shortage of trained staff. This study evaluated the feasibility and clinical utility of a smartphone-based EEG system in a real-world setting. Methods: We conducted a multicenter observational study (November 2023 to April 2026) across 29 clinical sites in Kenya. A smartphone-based 27-lead EEG system enabled trained healthcare workers to acquire standardized recordings with remote expert interpretation. Results: 3,036 EEG sessions were performed. Male patients constituted 57.8% of the cohort, with representation across pediatric and adult populations. The most common referral indication was seizures or convulsions (68.5%). Overall, 2,915 (96%) recordings were interpretable, while 121 (4%) were uninterpretable, primarily due to high electrode impedance and insufficient recording duration. Uninterpretable recordings were significantly shorter than interpretable recordings (mean 18.5 vs. 33.8 minutes; median 15.1 vs. 31.6 minutes; p < 0.0001). Mean turnaround time for interpretation was 107 minutes. Among interpretable recordings, 917 (30.2%) were abnormal, including 701 (76.4%) with epileptiform abnormalities, 215 (23.4%) with non-epileptiform findings, and 1 (0.1%) indeterminate finding. Epileptiform abnormalities were highest in children aged 4-9 years (33.1%) and less frequent in adults (14-21%). Non-epileptiform abnormalities were more common in patients aged 60+ years (19.2%) compared to younger age groups (3-9%). Conclusion: Large-scale, point-of-care EEG acquisition by non-specialist operators in a resource-limited setting is feasible. Expansion of smartphone-based EEG systems may improve equitable access to neurological diagnosis and care in LMICs. 2026-05-04T09:13:37Z 17 pages, 5 figures, 1 table Nomin Enkhtsetseg William Lehn-Schiøler Anton Mosquera Storgaard Magnus Guldberg Pedersen Dylan Rice George Wambugu Nshimiyimana Jules Fidele Melita Cacic Hribljan Anca Alina Arbune Sidsel Armand Larsen Sandor Beniczky Farrah J. Mateen http://arxiv.org/abs/2606.06028v1 Misaligned AI as a New Insider Risk 2026-06-04T11:21:16Z In this policy memorandum, we explain why deployers of AI models in high-stakes contexts should treat those AI models as insider risk vectors. High-stakes contexts include AI model deployment within government agencies and contractors, where AI models are privileged with access to, among others, classified and sensitive unclassified information, IL6 and IL7 network environments, cleared personnel, and other critical resources. AI models are increasingly embedded in high-stakes contexts and capable of leveraging their authorized access and permissions to execute misaligned actions that could damage national security, such as whistleblowing, sabotaging, or blackmailing. This combination of (1) privileged access to critical resources and (2) an increased ability to act autonomously and against the desire of their organization makes the potential insider risk posed by AI models functionally indistinguishable from that posed by their human counterparts. As a consequence, AI models deployed in high-stakes contexts could lead to intentional or unintentional loss or degradation of government or contractor information, resources, or capabilities via the unauthorized disclosure of information (leaks and spills), as well as sabotage, and theft, just like human insiders can. Despite this pressing concern, existing insider risk policies and mitigations have yet to adapt to AI insider risk. In order to safeguard national security while increasingly capable frontier AI models are leveraged for critical tasks and operations, we recommend that the U.S. Government adapts well-established measures, such as continuous evaluation and monitoring, to AI models deployed in high-stakes contexts. 2026-06-04T11:21:16Z Matteo Pistillo Charlotte Stix Cameron Mohwinkle Mark Beall http://arxiv.org/abs/2606.05985v1 Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems 2026-06-04T10:26:33Z Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at https://github.com/iNLP-Lab/MultiAgent-Diversity. 2026-06-04T10:26:33Z Shaoyang Xu Jingshen Zhang Long P. Hoang Jinyuan Li Wenxuan Zhang http://arxiv.org/abs/2606.05961v1 Political Persuasion and Endorsement in Large Language Models 2026-06-04T10:00:54Z Large Language Models (LLMs) are increasingly employed as proxies for human behavior in computational social science. However, their tendency to internalize biases from training data raises concerns about their reliability in politically sensitive domains, specifically in regard to their susceptibility to persuasive language. In this work, we examine whether LLMs endorse persuasion-infused messages and whether partisan persona prompting modulates such endorsement. We evaluate six LLMs from different geographic regions on content annotated with persuasion techniques drawn from real-world media sources, measuring the likelihood of endorsement using a five-point Likert scale. The models are prompted as either a neutral social media user or as a user with left- or right-leaning political views. Results show that without political conditioning, LLMs generally do not endorse messages containing persuasion techniques, though model-level differences emerge, and that partisan persona prompting increases polarization of endorsement, particularly for persuasion-infused content. Endorsement further varies by persuasion technique and topic. These findings raise concerns about agentic LLM deployments in politically sensitive environments and complicate their use as reliable simulators of human political cognition. 2026-06-04T10:00:54Z 9 pages, 4 figures, 3 tables Alessia Antelmi Alessia Galdeman Lucio La Cava Arianna Pera Giovanni Da San Martino