Building an Atlas of Social Experiments to Link Studies, Reconcile Conflicts, and Bridge Gaps

2026-05-26T15:14:41Z

Social and behavioral science runs thousands of experiments each year, yet their findings rarely accumulate into a coherent map of what is known, what conflicts, and what remains missing. We introduce ExAtlas, a framework for turning an archive of experiments into an atlas: a structured map in which studies link, conflict, or leave bridgeable gaps. Given a target study, ExAtlas searches for prior studies that are locally close in treatment and outcome space and asks whether their observed effects can be composed to predict the target effect. This yields three cases. If the composition succeeds and agrees with the observed result, ExAtlas links the target to consistent prior evidence. If composition succeeds but disagrees, ExAtlas reconciles the conflict and proposes candidate moderators or higher-level theories that could explain it. If composition fails, ExAtlas proposes bridge experiments to close the gap. We provide an error bound for composition under local smoothness of the treatment-effect surface. On held-out targets certified as locally supported, ExAtlas recovers effect direction in 98.6% of cases. Human evaluations further suggest that its proposed bridge experiments are plausible and exhibit connectedness, and that its conflict explanations are useful for theory generation. These results suggest that the archive of social experiments contains more latent structure than current practice extracts -- and that making this structure explicit can guide both future theory and future experimentation.

Grok in the Wild: Characterizing the Roles and Uses of Large Language Models on Social Media

2026-05-26T15:01:49Z

xAI's large language model, Grok, is called by millions of people each week on the social media platform X. Prior work characterizing how large language models are used has focused on private, one-on-one interactions. Grok's deployment on X represents a major departure from this setting, with interactions occurring in a public social space. In this paper, we systematically sample three months of interaction data to investigate how, when, and to what effect Grok is used on X. At the platform level, we find that Grok responds to 62% of requests, that the majority (51%) are in English, and that engagement is low, with half of Grok's responses receiving 20 or fewer views after 48 hours. We also inductively build a taxonomy of 10 roles that LLMs play in mediating social interactions and use these roles to analyze 41,735 interactions with Grok on X. We find that Grok most often serves as an information provider but, in contrast to LLM use in private one-on-one settings, also takes on roles related to dispute management, such as truth arbiter, advocate, and adversary. Finally, we characterize the population of X users who prompted Grok and find that their self-expressed interests are closely related to the roles the model assumes in the corresponding interactions. Our findings provide an initial quantitative description of human-AI interactions on X, and a broader understanding of the diverse roles that large language models might play in our online social spaces.

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

2026-05-26T13:54:15Z

Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.

Rewarding Engagement and Personalization in Popularity-Based Rankings Amplifies Extremism and Polarization

2026-05-26T13:45:02Z

Despite extensive research, the mechanisms through which online platforms shape extremism and polarization remain poorly understood. We identify and test a mechanism, grounded in empirical evidence, that explains how ranking algorithms can amplify both phenomena. This mechanism is based on well-documented assumptions: (i) users exhibit position bias and tend to prefer items displayed higher in the ranking, (ii) users prefer like-minded content, (iii) users with more extreme views are more likely to engage actively, and (iv) ranking algorithms are popularity-based, assigning higher positions to items that attract more clicks. Under these conditions, when platforms additionally reward \emph{active} engagement and implement \emph{personalized} rankings, users are inevitably driven toward more extremist and polarized news consumption. We formalize this mechanism in a dynamical model, which we evaluate by means of simulations and interactive experiments with hundreds of human participants, where the rankings are updated dynamically in response to user activity.

How Students (Mis)understand Conditionals and Loops -- A Taxonomy

2026-05-26T12:53:53Z

Understanding student difficulties in programming is a complex challenge due to the wide range of topics and the abundant varieties of misconceptions and errors. This paper presents the design and development of a fine-grained taxonomy that categorizes novice programmers' difficulties specifically related to reading and understanding the control flow constructs selection and iteration. Building upon prior research and our own empirical data from quizzes and interviews with students, the taxonomy is constructed through the iterative methodology of the Extended Taxonomy Design Process (ETDP). Key contributions include clear distinctions between different student difficulties and a detailed analysis of common student misunderstandings concerning conditional statements and loops. The taxonomy aims to aid computing education researchers by providing a harmonized framework to classify and analyze student errors, fostering deeper theoretical insights and informing pedagogical strategies. Future work will involve applying the taxonomy to novel student data and evaluating its usability among educators and researchers.

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

2026-05-26T12:23:33Z

We introduce DeepInterestGR, a novel framework that integrates deep interest mining into the generative recommendation pipeline. This addresses the "Shallow Interest" problem - existing generative methods rely on surface-level textual features and fail to capture latent user motivations, limiting personalization depth and recommendation interpretability. Our approach leverages Multi-LLM Interest Mining (MLIM) via structured reasoning prompting, Reward-Labeled Deep Interest (RLDI) for quality control, and Interest-Enhanced Item Discretization (IEID) via RQ-VAE, combined with a two-stage SFT-GRPO training pipeline guided by an Interest-Aware Reward. We validate DeepInterestGR on three Amazon Review benchmarks (Beauty, Sports, Instruments), comparing against 14 state-of-the-art baselines including SASRec, BERT4Rec, TIGER, LC-Rec, and S-DPO. Our method achieves 5.8%-8.3% relative improvements on HR@10 and 7.7%-9.9% on NDCG@10 over the strongest baseline, with cross-domain generalization gains of +24.8%. These results provide evidence that incorporating deep semantic interests can effectively improve SID-based generative recommendation.

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

2026-05-26T11:31:22Z

In recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 higher education students, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, thus outperforming both approaches without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical strategy that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

2026-05-26T09:56:20Z

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda's healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

2026-05-26T09:39:09Z

Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced and validated. These systems are not neutral; they actively contribute to the marginalization of non-hegemonic epistemologies. This research draws upon educational sciences, critical technology studies, and disability studies to demonstrate that training datasets, which remain predominantly Anglophone and Western-centric, reinforce epistemic coloniality. The situation of persons with disabilities provides a particularly clear illustration of this phenomenon. Technological architectures frequently confine these individuals to reductive stereotypes or exclude them from the design process, leading to a double marginalization. This article examines whether a hybridization between the researcher and the machine might preserve epistemic plurality, while acknowledging the structural limitations inherent in algorithmic correction when used as a purely palliative strategy.

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

2026-05-26T06:37:15Z

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

Examining the Challenges of Intellectual Property in AI-Generated Productions

2026-05-26T06:22:52Z

With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inventions without direct human intervention, the intellectual property (IP) regime faces unprecedented questions and challenges. The most critical issue concerns the ownership of moral and economic rights in the absence of a human creator, and how such outputs can be granted legal protection. This paper first reviews the theoretical foundations and existing literature in this domain, then comparatively examines Iranian legal frameworks such as the 1969 Law for the Protection of Authors, Composers, and Artists Rights and the Patent and Trademark Registration Law-alongside other legal systems, including the European Union, the United Kingdom, and the United States. Furthermore, existing legal perspectives on the intellectual property of AI-generated works and the related enforcement challenges are analyzed. The findings reveal significant regulatory gaps within the current Iranian legal framework. To balance the promotion of innovation with the preservation of human creativity, revising existing laws and introducing novel approaches such as defining a specific intellectual property right for AI-generated works or designating ownership among associated human agents appears to be essential.

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges

2026-05-26T01:00:46Z

Accurate attribution of authorship is crucial for maintaining the integrity of digital content, improving forensic investigations, and mitigating the risks of misinformation and plagiarism. Addressing the imperative need for proper authorship attribution is essential to uphold the credibility and accountability of authentic authorship. The rapid advancements of Large Language Models (LLMs) have blurred the lines between human and machine authorship, posing significant challenges for traditional methods. We present a comprehensive literature review that examines the latest research on authorship attribution in the era of LLMs. This survey systematically explores the landscape of this field by categorizing four representative problems: (1) Human-written Text Attribution; (2) LLM-generated Text Detection; (3) LLM-generated Text Attribution; and (4) Human-LLM Co-authored Text Attribution. We also discuss the challenges related to ensuring the generalization and explainability of authorship attribution methods. Generalization requires the ability to generalize across various domains, while explainability emphasizes providing transparent and understandable insights into the decisions made by these models. By evaluating the strengths and limitations of existing methods and benchmarks, we identify key open problems and future research directions in this field. This literature review serves as a roadmap for researchers and practitioners interested in understanding the state of the art in this rapidly evolving field.

The ATOM Report: Measuring the Open Language Model Ecosystem

2026-05-25T23:17:44Z

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba's Qwen, DeepSeek, Meta's Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.

Auditing the Reliability of Multimodal Generative Search

2026-05-25T20:42:51Z

Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains underexplored. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7\% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7\% and 18.7\% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($β= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($β= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs. The dataset is available at https://anonymous.4open.science/r/icwsm-gemini-audit-04DF .

A Technical Policy Blueprint for Trustworthy Decentralized AI

2026-05-25T20:35:16Z

Decentralized AI systems, such as federated learning, can play a critical role in further unlocking AI asset marketplaces (e.g., healthcare data marketplaces) thanks to increased asset privacy protection. Unlocking this big potential necessitates governance mechanisms that are transparent, scalable, and verifiable. However current governance approaches rely on bespoke, infrastructure-specific policies that hinder asset interoperability and trust among systems. We are proposing a Technical Policy Blueprint that encodes governance requirements as policy-as-code objects and separates asset policy verification from asset policy enforcement. In this architecture the Policy Engine verifies evidence (e.g., identities, signatures, payments, trusted-hardware attestations) and issues capability packages. Asset Guardians (e.g. data guardians, model guardians, computation guardians, etc.) enforce access or execution solely based on these capability packages. This core concept of decoupling policy processing from capabilities enables governance to evolve without reconfiguring AI infrastructure, thus creating an approach that is transparent, auditable, and resilient to change.