Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs

2026-05-18T19:06:15Z

Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors -- 87\% in the best model -- attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.

The NetMob26 Dataset: A High-Resolution Multi-Source View of Public Bus Mobility in Niterói

2026-05-18T18:30:19Z

The NetMob Data Challenge releases a comprehensive public transportation dataset from Niterói, addressing the lack of high-quality mobility and passenger demand data. Based on operational records from March 2026, the dataset combines four main sources: GPS telemetry from buses, approximately 7.2 million ticketing transactions, auxiliary transit data (routes, stops, and weather), and urban infrastructure and socio-demographic information. Together, these sources provide a detailed view of both transit supply and passenger demand. The data were preprocessed, cleaned, and anonymized to preserve privacy and improve reliability, including the removal of operational inconsistencies and anonymization of passenger identifiers. Access is restricted to challenge participants who accept the Terms and Conditions and sign an NDA. The paper describes the data collection and preprocessing pipeline, dataset organization, and mobility patterns observed in the system. The dataset supports research on topics such as public transportation efficiency, demand forecasting, accessibility analysis, service reliability, and the influence of external factors like weather on urban mobility.

Programmable Participatory Governance -- A Formal Framework for Transparent, Accountable, and Citizen-Responsive Democratic Systems: From Deliberative Theory to Decentralised Architecture

2026-05-18T17:56:00Z

Public confidence in democratic institutions has declined across many OECD countries over recent decades, while political participation and policy influence remain unevenly distributed across socioeconomic groups. Concurrently, democratic backsliding, declining electoral participation, and persistent concerns regarding institutional transparency and accountability have raised questions about whether existing governance structures are capable of sustaining broad-based legitimacy in complex modern societies. These developments motivate a central institutional design question: can governance systems be restructured to expand participation, improve transparency, and strengthen accountability without undermining stability or decision quality? This thesis proposes Programmable Participatory Governance (PPG), a formal governance framework designed to address these institutional deficits through the integration of democratic theory, institutional economics, and cryptographically verifiable distributed systems. PPG synthesises insights from deliberative and participatory democracy, collective action theory, direct democratic governance, and distributed computation to define a programmable architecture for transparent, verifiable, and scalable civic coordination. The framework is formally specified and evaluated through simulation and systems-oriented architectural analysis. The thesis examines how programmable governance mechanisms can support participatory decision-making while preserving procedural integrity, auditability, and institutional resilience under conditions of large-scale coordination. The objective is not to replace existing democratic institutions outright, but to explore how computationally mediated governance structures may augment or improve contemporary democratic processes in contexts where conventional institutions exhibit persistent structural limitations.

Generative AI Advertising as a Problem of Trustworthy Commercial Intervention

2026-05-18T17:15:06Z

Major deployed generative AI advertising systems preserve a visible boundary between commercial content and AI-generated responses. Yet empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users. We argue that generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels. This reframes generative AI advertising as a problem of trustworthy intervention rather than content placement. We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping; and show how these tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes. Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure. The central challenge is whether commercial influence in generative systems can be made trustworthy, i.e., attributable, measurable, contestable, and aligned with user welfare.

Revisiting the Regularity of Student Learning Rate: Sensitivity to Which Observations Are Included

2026-05-18T16:28:36Z

Mixed-effects models fit to observational practice data are widely used in learning analytics to estimate student-level variation in initial knowledge and learning rate, and the resulting estimates increasingly inform substantive claims about learners. We examine whether such estimates can be read as properties of learners or whether they depend on choices about which observations the model is fit to. As a case study, we revisit the ``astonishing regularity'' reported by Koedinger et al. (2023): that students vary substantially in initial knowledge but much less in learning rate. The finding is based on fits of the individual Additive Factors Model (iAFM) to 27 educational datasets, and rests on a model-derived estimate of student-level learning-rate variation being small in absolute terms. We refit the same model on the same datasets under two specifications, each varying how much of each student's practice on a given skill is used in fitting. The estimate of student-level variation in initial knowledge stays approximately stable across both specifications. The estimate of student-level variation in learning rate does not: it inflates by a median of 118\% under one specification and is several times larger under the other. The same model, fit to the same data, returns substantially different estimates of how much students vary in learning rate depending on which observations are included. When estimates from mixed-effects models on observational practice data are used to support substantive claims about learners, sensitivity to such choices deserves a central place in how those estimates are reported and read.

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

2026-05-18T15:31:42Z

Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $ρ$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.

Generativism: Toward a Learning Theory for the Age of Generative Artificial Intelligence

2026-05-18T15:22:43Z

The four dominant learning theories of behaviorism, cognitivism, constructivism, and connectivism show significant conceptual limitations as generative artificial intelligence (AI) proliferates in educational settings. These frameworks were formulated before the emergence of AI systems capable of generating, synthesizing, and reasoning about knowledge. This article critically examines each learning theory and identifies assumptions challenged by generative AI's affordances. Drawing on research in distributed cognition, extended mind, human-AI collaboration, AI literacy, cognitive offloading, and metacognition, the article proposes Generativism as a learning theory for the generative AI age. Generativism posits that learning increasingly occurs through the iterative co-construction of knowledge between human learners and AI systems. The proposed framework is organized around four principles: epistemic partnership, distributed agency, generative literacy, and adaptive metacognition. The framework offers a foundation for rethinking instructional design, learning, assessment, and expertise development in contexts where generative AI plays an integral role in cognition.

REBAR: Reference Ethical Benchmark for Autonomy Readiness

2026-05-18T13:56:19Z

As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.

Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation

2026-05-18T13:42:23Z

Large language models (LLMs) exhibit systematic political biases in voter simulations, but their underlying mechanisms and cross-lingual generalizations remain poorly understood. We introduce Dynamo-K, a census-grounded simulation framework evaluating Korean-language LLM political behavior across four models on six Korean elections (2017-2025). Using this framework, we identify three systematic failure modes: (1) progressive bias in moderate agents, where explicit mitigation reduces Mean Absolute Error (MAE) by 5.2 times; (2) model-dependent third-party salience collapse, distinguishing between salience failure and decision bias; and (3) regional polarization collapse, where models bidirectionally under-predict historical party strongholds. To address these failures, we demonstrate that scenario reframing recovers 62% of 2017 MAE by restoring third-party visibility. Furthermore, we introduce a learned reweighting adapter that successfully calibrates opposing-valence models without relying on candidate names at train or test time. Validating our diagnostic framework, Dynamo-K accurately predicts 3/3 presidential winners - including a 2.1%p MAE on the highly contested 0.73%p-margin 2022 race - and correctly identifies the dominant party in a held-out local election. The pipeline is open-source and provides a scalable, cost-effective method for diagnosing LLM political behavior.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

2026-05-18T13:29:46Z

Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

2026-05-18T12:03:54Z

As multi-agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti-collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents), identity fluidity (agents being easily forked or modified), the boundary problem (distinguishing beneficial cooperation from harmful collusion), and adversarial adaptation (agents learning to evade detection).

The threat of analytic flexibility in using large language models to simulate human data

2026-05-18T11:22:40Z

Social scientists are now using large language models to create "silicon samples": synthetic datasets intended to stand in for human respondents. However, producing these samples requires many analytic choices, including model selection, sampling parameters, prompt format, and the amount of demographic or contextual information provided. Across two studies, I examine whether these choices materially affect correspondence between silicon samples and human data. In Study 1, I generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, evaluating whether configurations recovered participant rankings, response distributions, and between-scale correlations. Configurations varied substantially across all three criteria, and configurations that performed well on one dimension often performed poorly on another. In Study 2, I extended this analysis to a published silicon-sample use case by re-examining Argyle et al.'s (2023) Study 3 using 66 alternative configurations. Correlations between human and silicon association structures differed substantially across configurations, from r = .23 to r = .84. Taken together, the results from these studies demonstrate that different defensible configuration choices can materially alter conclusions about the fidelity of silicon samples. I call for greater attention to the threat of analytic flexibility in using silicon samples and outline strategies that researchers may adopt to reduce this threat.

Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

2026-05-18T10:56:36Z

How should well-being be prioritised in society, and what trade-offs are people willing to make between fairness and personal well-being? We investigate these questions using a stated preference experiment with a nationally representative UK sample (n = 300), in which participants evaluated life satisfaction outcomes for both themselves and others under conditions of uncertainty. Individual-level utility functions were estimated using an Expected Utility Maximisation (EUM) framework and tested for sensitivity to the overweighting of small probabilities, as characterised by Cumulative Prospect Theory (CPT). A majority of participants displayed concave (risk-averse) utility curves and showed stronger aversion to inequality in societal life satisfaction outcomes than to personal risk. These preferences were unrelated to political alignment, suggesting a shared normative stance on fairness in well-being that cuts across ideological boundaries. The results challenge use of average life satisfaction as a policy metric, and support the development of nonlinear utility-based alternatives that more accurately reflect collective human values. Implications for public policy, well-being measurement, and the design of value-aligned AI systems are discussed.

Faculty Orientations Shape Adoption of AI in Research and Teaching

2026-05-18T09:47:35Z

Despite the widespread availability of large language models (LLMs) in higher education, instructors vary substantially in their adoption and use of these tools, and the reasons for this variation remain poorly understood. A mixed-methods survey of 90 STEM faculty in the Research Corporation for Science Advancement (RCSA) Cottrell community examined relationships between AI use, attitudes, institutional context, and instructional practice. Exploratory factor analysis identified a coherent construct, \textit{AI pedagogical orientation}, that strongly predicted self-reported AI use across research, teaching, and other professional activities. Qualitative analysis indicated that this construct reflected differing views about the role AI should play in disciplinary thinking, learning, and expertise development, rather than simply positive or negative attitudes toward AI. Institutional initiatives, demographic variables, and information sources showed comparatively weak associations with AI use. The results suggest that existing technology-adoption models may not fully explain adoption in contexts where technologies interact directly with disciplinary reasoning and knowledge production.

The Ephemeral Web and the Case for Proactive Archiving

2026-05-18T04:10:32Z

The web is often treated as a durable record of institutional and social life, yet in practice it is fragile, revisable, and frequently ephemeral. Domains change, redesigns erase earlier material, institutions relocate, maintainers graduate, platforms impose silent limits, and periods of political instability can interrupt digital access entirely. This paper argues that archiving should not remain a niche activity practiced by a few specialists at the margins, but should become a proactive part of website maintenance. I motivate this claim through a case study centered on the Pakistan Embassy International School and College Tehran, whose domain, visual identity, leadership, and physical location all changed within a short period after my graduation. In response, I built and deployed a lightweight automated archival system using Python and GitHub Actions to submit pages and media from the site to the Internet Archive's Wayback Machine. The project shows both that archival preservation can be automated with modest infrastructure and that archival systems are themselves vulnerable to interruption, as illustrated by GitHub's automatic disabling of scheduled workflows after repository inactivity. Drawing on personal experience with internet shutdowns in Iran, open-source sustainability lessons from RPI's RCOS, and the operational history of the archiver, I argue that the ephemerality of the web is not an exception but a structural condition. If digital societies wish to preserve institutional memory and public history without leaving preservation to chance, proactive archiving should become a commonplace part of website maintenance.