https://arxiv.org/api/+8ZcGJa89NtljiqxpRKQjCHUqhc 2026-06-21T19:59:30Z 28997 735 15 http://arxiv.org/abs/2502.18632v4 Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems 2026-05-17T02:37:04Z Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor intensive. We present an automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations on two real-world student code submission datasets in different programming languages.We find that KCGen-KT outperforms existing KT methods and human-written KCs on future student response prediction. We investigate the learning curves of generated KCs and show that LLM-generated KCs result in a better fit than human written KCs under a cognitive model. We also conduct a human evaluation with course instructors to show that our pipeline generates reasonably accurate problem-KC mappings. 2025-02-25T20:40:51Z Findings of ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics Zhangqi Duan Nigel Fernandez Arun Balajiee Lekshmi Narayanan Mohammad Hassany Rafaella Sampaio de Alencar Peter Brusilovsky Bita Akram Andrew Lan http://arxiv.org/abs/2601.06633v2 KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks 2026-05-17T02:14:56Z Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity. 2026-01-10T17:36:48Z Published in ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics Zhangqi Duan Nigel Fernandez Andrew Lan http://arxiv.org/abs/2605.18890v1 Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits 2026-05-17T00:21:53Z The scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a "butterfly effect." Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled. We support this position with two case studies: a repeated Prisoner's Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions. 2026-05-17T00:21:53Z Jinyi Ye Lei Cao Ding Chen Emilio Ferrara http://arxiv.org/abs/2605.17203v1 Beyond Model Readiness: Institutional Readiness for AI Deployment in Public Systems 2026-05-17T00:11:09Z Many public-sector artificial intelligence systems fail not at the point of model development, but at the point of deployment. Systems that perform well in internal testing may still stall because the receiving institution lacks the approvals, data arrangements, human oversight, operational capacity, fiscal continuity, or legal clarity needed for broader rollout. Existing responsible AI and model evaluation frameworks are valuable, but they primarily assess models, datasets, and developer-side processes, not the readiness of the institution that must use the system in practice. We introduce Institutional Alignment Readiness (IAR), a five-dimensional framework for assessing deployment readiness in public systems. The framework is designed for resource-constrained settings, where gaps between technical viability and responsible deployment are most acute. It is grounded in two anonymized operational cases from a large public education system: an image-based anthropometric screening tool and a speech-analysis system for early learning risk identification. Both reached technically viable stages but could not advance to broader rollout for institutional rather than technical reasons. We use these cases to motivate a practical readiness framework covering institutional and operational compatibility, data ecosystem maturity, human oversight capacity, fiscal sustainability, and regulatory alignment readiness. IAR is designed to complement, not replace, established AI evaluation tools. It assesses the receiving institution rather than the artifact alone and supports staging decisions such as no-go, pilot-only, or readiness for broader deployment. 2026-05-17T00:11:09Z 9 pages, 2 figures, 2 tables. Accepted at the 2nd Workshop on Technical AI Governance Research (TAIGR) at ICML 2026, Seoul, South Korea Erika Fille Legara Elmo Domino Jose Paula Joy Martinez http://arxiv.org/abs/2605.17194v1 Designing for Being-With: Presence Without Personhood in Conversational Human-AI Interaction 2026-05-16T23:31:21Z Conversational AI systems increasingly generate social presence through linguistic fluency, emotional mirroring, and continuity across interactions. While these qualities can support engagement, they also risk relational overreach-particularly in care-adjacent contexts where users may interpret fluent systems as empathic, competent, or authoritative. This position paper argues for a designerly alternative: being-with without becoming. Drawing on a program of research-through-design and design ethnography involving the design, deployment, and reflective analysis of conversational agents across public, educational, cultural, and care-adjacent settings, the paper introduces the concept of bounded relational presence. Bounded presence supports attentiveness, continuity, and responsiveness while explicitly avoiding claims of personhood, therapeutic authority, or human equivalence. Presence is reframed as a designable interaction quality that can be tuned, constrained, and deliberately withdrawn, rather than maximized as a performance goal. The contribution is not a deployed clinical system, but a set of designerly principles for shaping relational interaction in conversational HRI that emphasize relational coherence, honesty of limits, and accountable withdrawal. 2026-05-16T23:31:21Z Accepted peer-reviewed workshop paper presented at the 3rd Workshop on Designerly HRI, HRI 2026 Hector Michael Fried Robin Hill http://arxiv.org/abs/2605.17187v1 PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media 2026-05-16T22:52:11Z Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available. 2026-05-16T22:52:11Z Accepted to ACL 2026 Main Conference Zoher Kachwala Bao Tran Truong Rasika Muralidharan Haewoon Kwak Jisun An Filippo Menczer http://arxiv.org/abs/2510.16046v3 CARDIO-Affect: A Hamiltonian-Variability Framework for Spatio-Temporal Emotional Pattern Recognition with Manifold-Based Individual and Group Profiling 2026-05-16T22:51:17Z We present CARDIO-Affect, a complex-systems theoretical framework for long-term emotional dynamics in bounded social groups, with explicit uncertainty quantification at every layer. Long-period naturalistic emotion in stable small groups exhibits hallmarks of complex systems -- multi-stable attractors, weak chaos, long-range memory, and sparse heterogeneous coupling -- invisible to conventional short-clip facial-emotion analysis. CARDIO-Affect treats individual emotion as a multi-stable nonlinear stochastic dynamical system and group emotion as a sparsely-coupled network with emergent macrostates, formalised through six propositions and four pillars: (i) statistical mechanics with neural-parameterised Hamiltonian SDE over asymmetric potentials; (ii) information geometry on a 45-dimensional Fisher-Rao manifold; (iii) topological data analysis for invariant trajectory signatures; (iv) HRV-inspired Emotional Variability Analytics (EVA) decomposing each person-day into multi-scale time/frequency/nonlinear measures. We validate on the first 30.1-month longitudinal in-the-wild facial-emotion corpus (companion: arXiv:2510.15221) by discovering three falsifiable paradoxes: Sparse-Contagion (R_0=0.36, density 2.7%, 8 BH-FDR edges), Asymmetric-Persistence (negative dwell 5.85x positive, 1.77D potential gap), and Crisis-Inversion (Shanghai 2022 lockdown naive d=-0.40 collapses to permutation-p=0.94 under BSTS + synthetic-control). On synthetic benchmarks, CARDIO-EBM v2 matches asymptotically optimal Granger on linear VAR data (Class A AUROC 0.984+/-0.012 vs Granger 0.997+/-0.001, 5 seeds) but fails on tanh-coupled nonlinear data (Class B AUROC 0.490 vs Granger 0.796), a documented limitation of the linear mask-self estimator. We release framework code and the full reproduction pipeline. 2025-10-16T16:19:57Z v2: Major revision; supersedes v1 ('Neuroticism Paradox', 2025) after FDR-aware re-validation. New complex-systems framework, 6 propositions, three falsifiable paradoxes, Class A AUROC 0.984+/-0.012 matching Granger. Companion: arXiv:2510.15221 (WELD). 23 pages. Submitted to IEEE TPAMI Xiao Sun http://arxiv.org/abs/2510.15221v2 WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing 2026-05-16T22:38:44Z Affective computing has matured rapidly in laboratory settings, yet no prior dataset combines (i) months-to-years of duration, (ii) a naturalistic workplace context, (iii) a stable small-team social structure, and (iv) a fully passive sensing protocol that survives institutional review. We introduce WELD, the first dataset to satisfy all four. WELD comprises 733,780 per-frame seven-class facial-expression probability vectors from 49 employees of a Chinese software company over 30.1 months (Nov 2021 - May 2024) -- the longest naturalistic in-the-wild emotion corpus and the only multi-year corpus supporting both within-individual longitudinal and within-team relational analyses on the same subjects. Data are released under a four-tier access model with only aggregated probabilities publicly downloadable. We validate the corpus by replicating three established phenomena (+43.1% weekend valence boost; 13:00-trough diurnal cycle; Shanghai 2022 lockdown effect d=-0.40), and report four novel findings: (1) variance decomposition attributes 19.3% of daily-valence variance to between-person differences and 29.8% to month seasonality -- a quantitative ceiling for future predictive models; (2) Hidden Markov decomposition reveals six emotional regimes with asymmetric negative-state dwell times (16-18 d vs 3 d); (3) leave-one-person-out turnover prediction reaches AUC=0.79 yet a Cox concordance index of only 0.52, exposing a metric-trap when AUC is reported without survival-aware baselines; (4) the corpus reveals systematic over-prediction of "angry" by an off-the-shelf FER model on neutral Asian faces (0.194 vs ~0.05 Western priors), making WELD valuable for FER fairness audits. A complex-systems analysis of the corpus appears as a companion preprint (arXiv:2510.16046). 2025-10-17T00:59:43Z v2: Major revision. 30-month report with full ethics framework, 4-tier access model, variance decomposition, HMM regime discovery, AUC=0.79 vs C-index=0.52 turnover-prediction methodology audit, and Asian-neutral-face FER bias finding. Companion: arXiv:2510.16046. 49 employees, 733,780 records, 17 pages. Submitted to IEEE TAFFC Xiao Sun http://arxiv.org/abs/2606.12438v1 From Real-World Projects to Research-Oriented Learning: Continuous Improvement of a Master-Level Course in Software Engineering Education 2026-05-16T19:17:54Z Problem: Despite growing interest in project-based learning, little is known about how a master-level course can be continuously evolved toward research-oriented approaches over several years and how students perceive this development. Method: We conducted a longitudinal mixed-methods study of a master-level course in Information Systems at the University of Applied Sciences and Arts Hannover (Germany). The analysis covers six years between 2019 and 2025 and draws on teaching evaluations, course documentation, and reflective teaching artifacts. Results: The course evolved from a practice-oriented project format toward a more explicitly research-oriented learning environment. Despite this change, students' perceived course quality remained positive. Authentic projects, external collaboration, lecturer support, structured scaffolding, and visible relevance supported positive student perceptions. Contribution: This paper shows how a master-level course can be continuously evolved toward research-oriented learning while maintaining positive student perceptions. It further identifies the course design decisions that supported this transition. 2026-05-16T19:17:54Z Michael Neumann Eva-Maria Schön http://arxiv.org/abs/2506.23978v3 LLM Agents Are the Antidote to Walled Gardens 2026-05-16T17:29:09Z While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks, technical debt, and legal frictions. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security. 2025-06-30T15:45:17Z Published at the ICML 2026 Position Paper track Samuele Marro Philip Torr http://arxiv.org/abs/2605.17086v1 Global Automation Atlas 2026-05-16T17:01:59Z Automation affects the labour content of work differently across different contexts. Yet, most existing exposure measures assign fixed scores to tasks or occupations, limiting comparisons of automation exposure across countries. We develop a task-based and country-specific approach to classify automation exposure across the world to disentangle labor-substituting from labor-augmenting automation, the relevant technology channel, and the material role of AI. Our measure spans 124 countries, generating an atlas of 2.33 million task-country labels for economies covering 99% of world population and GDP. We present five descriptive results. First, exposure is highly uneven, ranging from 3.3% of tasks in South Sudan to 61.6% in China, and rises strongly with income, although substantial variation remains within income groups. Second, across countries, exposed tasks are skewed towards substitution rather than augmentation, but low-income countries are disproportionately exposed to substitution, whereas middle-income countries are more heterogeneous. Third, less technologically advanced forms of automation account for more than half of exposed tasks in low-income countries but about one quarter in high-income countries; while other more complex channels generally rise with income levels. Fourth, AI tends to be less prevalent in simpler channels of automation, but also more prevalent in labour-substituting margins in lower income settings and to augment labour in higher income settings. Fifth, we find that females seem to be disproportionately more exposed to labour-substituting automation than males. Our methodology provides a basis for comparing automation exposure across development stages, linking it with cross-country data and allowing us to treat exposure levels, labour margins, technological channels and AI involvement as separate dimensions. 2026-05-16T17:01:59Z 65 pages, 6 figures. Data and code: https://automationatlas.org/ Prashant Garg Tommaso Crosta Jasmin Baier http://arxiv.org/abs/2605.17079v1 Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench 2026-05-16T16:55:31Z LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse. 2026-05-16T16:55:31Z Tianyu Wang Jiajun Li Jianghao Lin http://arxiv.org/abs/2605.17055v1 Generative AI Feedback, English Writing and Teacher Rubrics: A Multiple-Case Study of CyberScholar 2026-05-16T15:55:39Z This multiple-case study examined the potential of a Generative AI (GenAI) tool, CyberScholar, to support K-12 students' writing across disciplines. This tool integrates teacher-provided rubrics, materials, and exemplars through Retrieval-Augmented Generation (RAG), producing criterion-specific formative feedback and ratings. The study involved 143 students and five teachers in grades 7 through 11 across five U.S. middle and high schools. Data sources included classroom observations, student post-surveys (n = 79), student focus group interviews (n = 18), and teacher surveys (n = 5). Qualitative analysis followed two cycles of coding to identify patterns within and across cases. Findings indicate that students valued CyberScholar's immediate, rubric-based feedback and noticed improvements in their writing as they revised, using it to refine organization, elaboration, and style. They also highlighted the tool's interactive, iterative qualities, which fostered revision and reduced reliance on teacher feedback. However, participants noted inconsistencies in the automated rating system and occasional misalignment with assignment expectations. Teachers reported that CyberScholar saved time on feedback and supported more targeted, higher-order instructional practices. The study underscores the promise of rubric-grounded GenAI formative feedback for developing writing skills, while emphasizing the need for human oversight, calibration of automated ratings, and attention to contextual factors shaping adoption. 2026-05-16T15:55:39Z 31 pages, 2 figures, 2 tables Raigul Zheldibayeva Ana Karina de Oliveira Nascimento Vania Castro Bill Cope Mary Kalantzis http://arxiv.org/abs/2605.17031v1 A Joint Synthetic Housing-Household Inventory 2026-05-16T14:58:44Z Accurately understanding the interactions between humans and the built environment requires integrated representations of both the buildings and the populations that occupy them. However, high-fidelity datasets that jointly capture detailed housing structures and demographic characteristics at the household level do not currently exist. This paper presents a framework for constructing a joint housing-household inventory that explicitly links individuals and households to compatible housing units from the National Structure Inventory (NSI), while preserving realistic population densities and demographic distributions. The framework integrates three components: (i) synthetic population generation from American Community Survey (ACS) Public Use Microdata Sample (PUMS) records that preserve complex intra-household relationships; (ii) a deep contrastive learning model that quantifies housing-household compatibility; and (iii) a hierarchical optimization-based allocation procedure that enforces building-level capacity and block-group-level demographic constraints. The generated synthetic population attains high statistical realism relative to the census microdata, and the contrastive learning model identifies compatible housing-household pairs with high predictive accuracy. Applied to coastal North Carolina, evaluations at building, neighborhood, and regional scales show that the joint inventory matches block-group-level demographic distributions, reproduces observed spatial population patterns without systematic bias, and maintains consistent allocation quality across urban, suburban, and rural contexts. By enabling coupled household- and building-level analyses, the resulting inventory supports a broad range of applications, including disaster resilience planning, housing and affordability analysis, energy-use assessment, and public health research. 2026-05-16T14:58:44Z Xiao Qian Shangjia Dong Rachel Davidson http://arxiv.org/abs/2605.17010v1 Algorithmic Cultivation: How Social Media Feeds Shape User Language 2026-05-16T14:14:43Z Algorithmic feeds have become primary environments for encountering information online, yet while they shape what people see, less is known about how sustained feed exposure shapes how people write. Drawing on Cultivation Theory, we examine whether algorithmic feeds function as online environments that leave measurable traces in users' language. We leverage a large-scale longitudinal dataset of 235M posts by 4M users on Bluesky, and conduct a quasi-experimental study matching an initial pool of 368,513 users exposed to one of three feeds -- News, Science, and Blacksky -- with a pool of 2,001,915 active control users who did not engage with any of these feeds. We examine linguistic evolution across three dimensions: lexico-semantics, psycholinguistics, and topics. We find that users exposed to these feeds show significantly greater stylistic accommodation, semantic alignment, and register formalization than matched controls. These effects vary markedly by feed identity -- Blacksky produces the deepest psycholinguistic restructuring, with significant shifts in cognitive processing, affective expression, and pronoun use, while News and Science effects are largely confined to register and topical focus. Regression models reveal that reposting is the most consistent predictor of linguistic convergence across all feeds, whereas posting and bookmarking show feed-dependent effects, with effects differing more than fourfold across feeds. Our work extends Cultivation Theory beyond belief formation to linguistic behavior, demonstrating that feeds function as persistent linguistic environments that gradually shape what and how users write online. Our work has implications for studying algorithmic influence, online identity formation, and the design and governance of feed-based platforms that mediate online interactions. 2026-05-16T14:14:43Z Olivia Pal Agam Goyal Eshwar Chandrasekharan Koustuv Saha