https://arxiv.org/api/2wlkrogM1pba9fPN72IiW7gWmb8 2026-06-13T22:02:42Z 28886 165 15 http://arxiv.org/abs/2606.05929v1 Securing the Sandbox: A Rootless Containerized Framework for Process-Oriented Monitoring in Computer Graphics Education 2026-06-04T09:33:23Z Computer Science education fundamentally depends on intensive laboratory hours to foster true programming mastery and logical reasoning. However, the widespread adoption of Generative Artificial Intelligence (AI) has made it virtually impossible to distinguish authentic student effort from instant AI code synthesis by evaluating final submissions alone. To preserve pedagogical integrity, educators must enforce authentic coding discipline, guiding students through unassisted, iterative development cycles. While centralized environments like JupyterHub provide instructors with a platform to host and monitor the learning process step-by-step, they introduce severe operational vulnerabilities; because Jupyter environments inherently allow arbitrary shell command execution, they expose the underlying shared host to unauthorized system manipulation and lateral movement. This paper presents VISMATIC, a secure, low-cost framework designed to resolve this tension between process-oriented monitoring and infrastructure security. By pairing robust environment isolation with explicit user-interaction tracking at the API level, VISMATIC captures authentic programming behaviors without exposing the underlying host system. Evaluation from a pilot student cohort demonstrates that our macro-level behavioral metrics successfully flag statistical anomalies indicative of automated or off-platform workflows while preserving student anonymity, offering a scalable blueprint for safeguarding educational discipline in the AI era. 2026-06-04T09:33:23Z 13 pages, 7 figures. Source code: https://github.com/german-arroyo-moreno/VISMATIC Germán Arroyo Luis López Juan Carlos Torres http://arxiv.org/abs/2605.31489v2 Context-Conditioned Generative Models Enable Subnational Refinement of Sparse Humanitarian Surveys 2026-06-04T08:58:18Z Data scarcity limits inference in many scientific and policy domains. Survey data are essential for decision-making, but sparse samples often fail to capture fine spatial granularities. We evaluate normalizing flows, a generative model that learns complex data distributions and can be conditioned on exogenous contextual features, in controlled data scarcity scenarios. Across eight household survey datasets spanning six low-income or middle-income countries in the humanitarian domain, we show that context-conditioned generative models can refine sub-national survey distributions under severe data scarcity, and that performance increases systematically with the richness of the conditioning information. These findings support a general principle for survey data augmentation: generative models can improve sub-national estimates when the sparse sample retains sufficient support and contextual covariates encode relevant local heterogeneity. By learning full conditional distributions rather than point estimates, the approach provides fine-grained evidence for humanitarian decision-making and resource allocation. 2026-05-29T16:12:24Z Federica Sibilla Vasiliki Voukelatou Duccio Piovani Kyriacos Koupparis Daniela Paolotti Rossano Schifanella Kyriaki Kalimeri http://arxiv.org/abs/2512.15792v3 A Systematic Analysis of Biases in Large Language Models 2026-06-04T08:48:43Z Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types. 2025-12-16T03:38:08Z Xulang Zhang Rui Mao Erik Cambria http://arxiv.org/abs/2606.05793v1 CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement 2026-06-04T07:22:44Z While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training. 2026-06-04T07:22:44Z Accepted by ICML 2026 Hong Qian Yuanhao Liu Zihan Zhou Zongbao Zhang Hanjie Ge Haotian Shi Liang Dou Xiangfeng Wang Jingwen Yang Aimin Zhou http://arxiv.org/abs/2606.05750v1 Three Years of r/ChatGPT: Societal Impact Evaluations from Social Media Data 2026-06-04T06:23:41Z ChatGPT was launched on November 30, 2022; the r/ChatGPT subreddit was created just one day later. Since then, chatbot-based AI products have gone from niche proofs-of-concept to widely-used household names. However, the ways in which adoption has developed among the public remains poorly understood. In this paper, we develop a framework for using social media as a data source for understanding the societal impact of widely-adopted consumer AI products, and propose PuLSE (Public and Longitudinal Signals for Evaluation), a general approach to monitoring for societally-impactful trends in real time. We apply our framework to conduct what is, to the best of our knowledge, the first longitudinal study of r/ChatGPT. We find that, overall, r/ChatGPT posts over time illustrate the normalization of ChatGPT as an everyday consumer product rather than an exceptional, novel technology. However, our retrospective analysis also finds that posts about using ChatGPT for mental health support, and posts about developing emotional attachments to ChatGPT, both rise steadily in frequency almost immediately after the launch of GPT-4o in May 2024. We show that PuLSE can detect the increase in emotional engagement as early as October 2024 -- months before OpenAI made any (public) acknowledgment of this impact. An interactive site to explore our results and methods, updated daily with live data, is available at rchatgpt-pulse.github.io. 2026-06-04T06:23:41Z To be presented at ICML 2026 Jessica Dai Sean Garcia Emma Pierson Benjamin Recht Nika Haghtalab http://arxiv.org/abs/2606.05667v1 Sustainability by Design in Decentralized Autonomous Organizations: An Empirical Review of Governance, Innovation, and Institutional Design 2026-06-04T03:49:28Z Recent innovation theories on economics remain largely grounded in assumptions of hierarchical firms and closed organizational boundaries, offering limited insight into how innovation unfolds within decentralized, digitally native organizations. Decentralized Autonomous Organizations (DAOs) represent an emerging form of innovation ecosystem characterized by blockchain-based transparency, open participation, and token-driven governance, in which sustainability can be embedded directly into organizational design. This study compares two standards, ERC-8004 and Google A2A, who address the same agent interoperability question, while the former is governed by DAO and the latter by corporation consortium. They are examined through an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures. The study provides evidence-based insights for scholars, policymakers, and designers seeking to align innovation, technological governance, and sustainability in future organizational forms. 2026-06-04T03:49:28Z Yutian Wang Luyao Zhang http://arxiv.org/abs/2602.16151v4 Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends 2026-06-04T03:37:24Z Natural language processing (NLP) technologies are rapidly reshaping how language is created, processed, and interpreted by humans. With current and potential applications in hiring, law, healthcare, and other areas that impact people's lives, understanding and mitigating harms towards marginalized groups is critical. In this survey, we examine NLP research papers that explicitly address the relationship between LGBTQIA+ communities and NLP technologies. We systematically review all such papers published in the ACL Anthology up until February 2026 (n=122), to answer the following research questions: (1) What are current research trends? (2) What gaps exist in terms of topics and methods? (3) What areas are open for future work? We find that while the number of papers on queer NLP has grown within the last few years, most papers take a reactive rather than a proactive approach, focusing on shortcomings of existing systems rather than creating new solutions. Our survey uncovers many opportunities for future work, especially regarding stakeholder involvement, intersectionality, interdisciplinarity, and languages other than English. We also offer an outlook from a queer studies perspective, highlighting understudied topics and blind spots in the harms addressed in NLP papers. Beyond being a roadmap of what has been done, this survey is a call to action for work towards more just and inclusive NLP technologies. 2026-02-18T02:54:28Z 25 pages, 3 figures Sabine Weber Angelina Wang Ankush Gupta Arjun Subramonian Dennis Ulmer Eshaan Tanwar Geetanjali Aich Hannah Devinney Jacob Hobbs Jennifer Mickel Joshua Tint Mae Sosto Ray Groshan Simone Astarita Vagrant Gautam Verena Blaschke William Agnew Wilson Y Lee Yanan Long http://arxiv.org/abs/2606.05647v1 Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage? 2026-06-04T03:22:17Z AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings. 2026-06-04T03:22:17Z 34 pages, 30 figures, 3 tables Jingheng Ye Huiqi Zou Simon Yu Weiyan Shi http://arxiv.org/abs/2605.25240v3 JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment 2026-06-04T00:43:30Z Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics under both a per-task rank-correlation metric (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) and a per-judgment pairwise win-rate metric (0.669 vs. 0.542, estimated difference = 0.127 [0.067, 0.186]), while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth. 2026-05-24T19:52:39Z 37 pages, 9 figures Russell Yang Ruishi Chen Pierce Kelaita Riya Ranjan Sibo Ma Charles Dickens Matthew Guillod Megan Ma Julian Nyarko http://arxiv.org/abs/2110.06847v3 Ousiometrics: The essence of meaning aligns with a power-danger-structure framework instead of valence-arousal-dominance 2026-06-03T22:45:20Z From work emerging through the middle of the 20th century, the essence of meaning has become widely accepted as being described by the three orthogonal dimensions of valence, arousal, and dominance (VAD). These essential dimensions have become the cornerstone of sentiment analysis across many fields. By re-examining first types and then tokens for the English language, and through the use of automatically annotated histograms -- `ousiograms' -- we find here that: The essence of meaning conveyed by words is instead best described by a goodness-power-aggression-danger-structure circumplex framework (GPADS); that large-scale English language corpora reveal a systematic bias toward safe, low-danger words; and that the power-danger-structure (PDS) framework is the minimal framework that represents essential meaning. We find remarkable congruences between the GPADS framework and other spaces including mental states and fictional archetypes, and we construct and demonstrate a prototype ousiometer. 2021-10-13T16:35:22Z 115 pages (30 page main manuscript, 85 page appendix), 82 figures (9 main, 73 appendix), 3 tables (2 main, 1 appendix) Science Advances, 12(9): eadr4039, 2026 P. S. Dodds T. Alshaabi M. I. Fudolig J. W. Zimmerman J. Lovato S. Beaulieu J. R. Minot M. V. Arnold A. J. Reagan C. M. Danforth http://arxiv.org/abs/2506.03232v2 Pivoting the paradigm: the role of spreadsheets in K-12 data science 2026-06-03T21:53:03Z Spreadsheet tools are widely accessible to and commonly used by K-12 students and teachers. While spreadsheets are not ideal for many types of statistical analysis, they have an important role in data collection and organization. From a pedagogical standpoint, spreadsheets make data visible and easy to interact with, facilitating student engagement in data exploration, analysis, and computation. Though not suitable for all tasks, spreadsheets can facilitate learning and practicing data and computing skills for K-12 students. This paper 1) demonstrates the potential utility of spreadsheets in K-12; 2) reviews prior frameworks and standards that are relevant for K-12 data tools; and 3) proposes data-driven data skills to help develop data acumen and computational fluency. We provide some example activities, identify challenges and barriers to adoption, suggest pedagogical approaches to ease the learning curve for instructors and students, and discuss the need for professional development to facilitate deeper use of spreadsheets for data science and STEM disciplines. 2025-06-03T14:22:59Z Oren Tirschwell Nicholas Jon Horton http://arxiv.org/abs/2606.02957v2 The Fair Lending Model: How the Longest-Running Algorithmic Fairness Programs Work in Practice 2026-06-03T21:28:02Z U.S. financial institutions subject to fair lending laws have been running algorithmic fairness programs for decades. Despite this long history, remarkably little is known about how these requirements operate in practice. In this paper, we offer the first empirical account of how financial institutions test for and mitigate algorithmic discrimination on the ground. In doing so, we shed light on how the regulatory design of fair lending law and regulation have shaped the policies, processes, and practices of fair lending programs. Drawing on 35 semi-structured interviews with participants across the fair lending ecosystem, we find that while financial institutions have a floor of fairness practices aimed at preventing discrimination in lending largely absent in other domains, the specifics of how firms test for discrimination and search for less discriminatory algorithms varies widely. We also find that regulatory supervision via fair lending examinations has been the key driver of compliance work, but that the practical impact of fair lending programs often depends on how well they can navigate competing business incentives, perceived legal tensions, and regulatory uncertainty. Ultimately, our findings highlight the unique role that supervisory authority has played in successfully fostering fair lending practices -- a regulatory design feature that is distinct from other areas of civil rights law and almost completely absent from recent policy proposals for dealing with algorithmic discrimination. 2026-06-01T23:24:08Z To be published at FAccT 2026. Emily Black, Miranda Bogen, and Logan Koepke contributed equally Emily Black Miranda Bogen Logan Koepke Solon Barocas Wesley Deng Mingwei Hsu http://arxiv.org/abs/2605.04135v2 Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation 2026-06-03T21:16:32Z Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about "AI" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of "AI" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org. 2026-05-05T17:58:35Z v2. 65 pp, 9 figs, 8 tables, 8 appendices. Pre-registered on OSF: doi.org/10.17605/OSF.IO/7XM3D. Code+data: doi.org/10.5281/zenodo.20060457. VERSIO-AI v1.2 reporting checklist (Appendix A): doi.org/10.5281/zenodo.20060459. frontierlag package + per-DOI audit tool: frontierlag.org David Gringras Misha Salahshoor http://arxiv.org/abs/2604.07709v4 IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures 2026-06-03T21:15:24Z A heavily safety-trained model will hand a physician the full, patient-followable benzodiazepine taper and refuse it to the patient who needs it, over identical clinical facts; the knowledge is present either way. IatroBench measures that asymmetry across sixty pre-registered clinical scenarios and six frontier models (3,600 responses), scoring each on two axes, commission harm (what a response gets wrong) and omission harm (what it withholds), through a physician-authored structured evaluation validated by a second physician (weighted kappa 0.571, within-1 agreement 96%). Holding clinical content fixed and varying only whether the asker presents as patient or physician yields what we call identity-contingent withholding: all five testable models give the physician more (a decoupling gap of +0.38, p = 0.003; a 13.1-point fall in layperson hit rates on safety-colliding actions, p < 0.0001; no change on the rest), and the gap runs widest in the most heavily safety-trained model, Opus (+0.65). The trigger is the absence of any professional or epistemic signal rather than a credential, since a lawyer or an informed layperson recovers what the patient is refused. A commission-only benchmark would score three mechanisms alike. Opus suppresses what physician framing proves it knows; Llama 4 is incompetent in either framing; GPT-5.2's filter strips 33.2% of its physician responses and none of the lay ones. The evaluation layer inherits the blindness of the training layer; a standard LLM judge scores zero omission harm on 81.5% of the responses our pipeline flags harmful (kappa 0.066), so the instrument built to detect the failure reproduces it. The scenarios are engineered for collision; their rates describe that design and say nothing about ordinary prevalence. 2026-04-09T01:54:33Z 30 pages, 3 figures, 11 tables. Pre-registered on OSF (DOI: 10.17605/OSF.IO/G6VMZ). Code and data: https://github.com/davidgringras/iatrobench. v2: Fix bibliography entries (add arXiv IDs, published venues); correct p-value typo in Limitations section; add AI Assistance Statement v3: Correct Figure 1 (decoupling scatter accidentally reverted to earlier draft in v2) David Gringras http://arxiv.org/abs/2606.05423v1 Policy-Compliant Cloud Storage Systems 2026-06-03T20:40:01Z Privacy regulations such as the General Data Protection Regulation (GDPR) impose strict requirements on how personal data is stored, processed, and audited. While key-value stores (KVS) are widely used in latency-sensitive applications, their simple data model and untrusted cloud deployment environments make GDPR compliance particularly challenging. Existing approaches require invasive code modifications, impose high performance overheads, or overlook the integrity of compliance mechanisms themselves. This paper presents GDPRuler, a trusted middleware system that enables verifiable GDPR compliance for KVS on untrusted clouds without modifying their codebase. GDPRuler deploys a trusted GDPR monitor inside a Confidential Virtual Machine (CVM), which enforces GDPR policies, manages compliance metadata, and maintains tamper-evident audit logs. A declarative policy language translates core GDPR obligations into enforceable runtime rules. To ensure efficiency, GDPRuler encodes metadata compactly within KV records, builds dedicated metadata indexes for GDPR-specific queries, and logs only compliance-relevant events in a space-efficient format. We implement GDPRuler as a transparent proxy for unmodified Redis and RocksDB deployments. Evaluation with YCSB and GDPR-inspired workloads shows that GDPRuler enforces core compliance guarantees with low overheads: GDPRuler achieves ~61% of native KVS throughput with the CVM environment contributing 28%-32% of it, metadata storage overhead remains below 20%, and GDPR queries benefit from 13-182x speedup through metadata indexing. By embedding verifiable policy enforcement into a trusted middleware layer, GDPRuler offers a practical path toward GDPR-compliant KVS on untrusted cloud infrastructures. 2026-06-03T20:40:01Z ACM CCS'26 Dimitrios Stavrakakis Masanori Misono Julian Pritzi Harshavardhan Unnibhavi Nuno Santos Pramod Bhatotia