https://arxiv.org/api/XPRPwQBrMttnpCnQrJhkFFbB0CU2026-06-13T17:23:51Z2888610515http://arxiv.org/abs/2606.09006v1Sustainability and Artificial Intelligence: Necessary, Challenging, and Promising Intersections2026-06-08T04:09:01ZBoth digital economy and digital technology researchers increasingly recognize the need to better address the role that artificial intelligence (AI) plays in shaping the evolution of the environmental, social and governance aspects of development. It appears that sustainability and AI research converge on the features of wicked problems that are complex, interconnected and dynamic. Building off such convergence, this article aims to map out the necessary, challenging, and promising intersections by providing an overview of the state of art research. Based on 541 bibliographic data collected from the Web of Science (WoS) database, the findings reveal the increasingly central body of work on green and sustainable science and technology in bridging various disciplines, main journals and key topics and concepts. The findings reveal how such interactions can be necessary, challenging, and promising. The article concludes with few general arguments regarding how to diversify and expand the community of practice regarding AI for sustainable development, especially in the areas of expected AI application areas and institutions.2026-06-08T04:09:01ZThis is an author preprint version. For the final authenticated version of record, please use the official publication via the IEEE Xplore database. DOI: 10.1109/MSIEID52046.2020.000762020 Management Science Informatization and Economic Innovation Development Conference (MSIEID), Guangzhou, China, 2020, pp. 360-363Han-Teng LiaoZijia Wang10.1109/MSIEID52046.2020.00076http://arxiv.org/abs/2606.08998v1The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs2026-06-08T03:53:55ZAgentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. A foundation model is a large pretrained model, usually adaptable to many downstream tasks, that maps an input context to predictions over outputs. In many current agents, that model is embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, the manuscript clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.2026-06-08T03:53:55ZMuhammad Zia HydariRaja Iqbalhttp://arxiv.org/abs/2602.12450v2Empirical Modeling of Therapist-Client Dynamics in Psychotherapy Using LLM-Based Assessments2026-06-08T00:59:55ZPsychotherapy is a primary treatment for many mental health conditions, yet the interplay among therapist behaviors, client responses, and the therapeutic relationship is difficult to study at scale, as process research has relied on labor-intensive human coding. We develop and validate a computational framework for modeling therapist-client interaction, using large language models (LLMs) to measure therapist behaviors (empathy, exploration), relational quality (rapport), and client outcomes (self-disclosure, self-directed and outward-directed negative emotion). After validating model-generated scores against human annotations (ICC = 0.45-0.81; rapport 0.81, self-disclosure 0.78), we apply these measures to roughly 2,000 hours of transcripts from the Alexander Street corpus and use Structural Equation Modeling to estimate moment-to-moment relationships among therapist behaviors, rapport, and subsequent client responses, controlling for prior client state and context. Therapist empathy and exploration directly predict increased client disclosure and shifts in emotional expression; empathy is more strongly associated with self-directed than outward-directed negative emotion, suggesting greater acknowledgment of internal distress, while exploration increases disclosure and emotional elaboration. Rapport does not directly amplify disclosure or emotional intensity but instead moderates the associations between therapist behaviors and client affect, potentially contributing to reductions in internal distress. These results show that LLM-based measurement combined with structural modeling can capture core therapeutic processes at scale, with empathy and exploration acting directly and rapport as a contextual moderator, providing a foundation for precision modeling of psychotherapy and for scalable therapist training and AI-supported clinical education.2026-02-12T22:14:07ZAngela ChenSiwei JinCanwen WangHolly SwartzTongshuang WuRobert E KrautHaiyi Zhuhttp://arxiv.org/abs/2606.08855v1Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations2026-06-07T21:50:20ZThis paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.2026-06-07T21:50:20Z15 pages, 6 figuresHartwig GrabowskiMichael Canzhttp://arxiv.org/abs/2606.08851v1Enforcing Trust Accountability with Backward Propagation2026-06-07T21:43:42ZTrust and reputation management underpins reliable interactions in distributed networks, yet existing trust models rely solely on forward propagation of interaction-based trust signals. They lack robust mechanisms to enforce accountability for the propagated trust signals when negative interactions occur. In addition, such models often fail to initialize newly joined nodes with sparse interaction history, leading to the cold-start problem. In this paper, we propose RepuLink, a two-layer reputation model that couples an endorsement network with an interaction feedback network. RepuLink integrates two concurrent backward propagation mechanisms: Backward Endorsement Penalty Propagation (BEPP), which recursively penalizes endorsers of misbehaving nodes, and Backward Endorsement Reward Propagation (BERP), which rewards endorsers of well-performing nodes. Together, RepuLink enforces endorsement accountability and incentivizes positive behaviors, which form a positive interaction feedback loop. The endorsement layer further provides explainable, endorser-weighted trust initialization for newly joined nodes. Experiments on real-world datasets against representative trust propagation baselines demonstrate that RepuLink outperforms across four evaluation metrics in both interaction-only and full two-layer settings, while preserving comparable efficiency.2026-06-07T21:43:42ZWenbo WuGeorge Konstantinidis10.1145/3770855.3817611http://arxiv.org/abs/2606.08807v1A Classroom Study of LLM-Generated Feedback Intervention in Introductory Programming2026-06-07T19:55:51ZLarge language models (LLMs) are increasingly used to provide automated feedback in introductory programming courses, yet empirical evidence from authentic classroom deployments comparing different feedback modalities remains limited. In this work, we present a large-scale classroom study in which AI-generated feedback was deployed through a randomized protocol in an introductory Python programming course. Students received one of three feedback conditions on incorrect submissions: natural language hints, AI-generated failing test cases, or no AI feedback. We release the resulting dataset, ProgFeed, which captures 6,693 submissions from 215 consenting students across 17 labs, including feedback conditions, execution-based performance measures, and fine-grained temporal information. Using this data, we analyze learning trajectories, feedback quality, and submission behavior over repeated attempts. We find that natural language feedback is significantly associated with higher completion rates and faster convergence to correct solutions. Test case feedback, by contrast, exhibits heterogeneous effects that depend critically on feedback validity. Our results suggest that the form of AI-generated feedback matters, and that evaluating feedback quality -- not just its presence -- is essential for understanding its pedagogical impact.2026-06-07T19:55:51ZAccepted at IRAISE 2026 (Festival of Learning)Hasnain HeickalAndrew Lanhttp://arxiv.org/abs/2602.00056v4How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI2026-06-07T17:59:22ZLarge-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication drives substantial and growing environmental costs while systematically redistributing labour risks and representational harms toward the Global South. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.2026-01-20T00:54:37ZProceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. Montreal, CanadaSophia N. WilsonSebastian MairMophat OkinyiErik B. DamJanin KochRaghavendra Selvan10.1145/3805689.3812393http://arxiv.org/abs/2606.08701v1Is Telehealth Better Used to Treat Patients or Help Other Physicians Treat Patients? An Agent-Based Modeling Study of Healthcare Provision2026-06-07T15:56:48ZTelehealth, the delivery of medical care remotely, is hoped to increase access to specialty services or decrease health care utilization. Physicians can provide telehealth to each other or to patients. Specialists often treat complex patients who can be adequately cared for only in academic hospitals, suggesting that providing specialty services via telehealth will reallocate rather than reduce system utilization. Here I use agent-based modeling to investigate telehealth's effects on clinical outcomes and system utilization in medical toxicology. I found that physician-physician telehealth increased patient health but system utilization did not change. The effects were more pronounced as clinical complexity increased. Physician-patient telehealth increased cost and system utilization but not clinical outcomes. Within the limitations of our approach, these results suggest that telehealth is more cost-effective for improving generalist access to specialist knowledge than in providing care to the public.2026-06-07T15:56:48ZPresented at HICSS 2022Michael Charyhttp://arxiv.org/abs/2606.08568v1Regulating the AI Tutor: Intentions, Help-Seeking, and Self-Regulated Learning in Adolescent GenAI Use2026-06-07T10:47:52ZGenerative AI (GenAI) tools are now common learning companions for adolescents, yet how they regulate their use during authentic learning tasks remains poorly understood. Self-regulated learning (SRL) and high-level help-seeking (HS) are commonly proposed as safeguards against passive or shortcut-oriented use, but most empirical studies focus on aggregate learning outcomes rather than these moment-to-moment processes during AI-supported learning.
This work-in-progress examines open-ended conversational data from 98 Grade-9 students across three German Gymnasium schools, who used a web-based Mistral-Large tutor to prepare a curriculum-aligned mathematics skill before an exam. Alongside chat logs (1,616 turns; 808 student turns), we collected pre-post domain knowledge, pre-chat learning needs, and self-reported cognitive load. We propose a turn-level codebook combining theory-driven SRL and HS constructs with two LLM-specific inductive codes (agency over the AI; epistemic vigilance), and report preliminary AI-coded results.
Although students overwhelmingly selected scaffolded support before the chat, their interactions were dominated by instrumental requests with almost no explicit monitoring or evaluation. Post-test performance was significantly lower than pre-test, and higher extraneous cognitive load predicted lower post-test scores after controlling for prior knowledge. We discuss how these patterns can support hybrid human-AI analysis of interaction patterns and inform scaffolds for more agentic and epistemically proactive GenAI use.2026-06-07T10:47:52ZRania AbdelghaniPeter KaiserKou Murayamahttp://arxiv.org/abs/2606.08512v1Friend or Foe? Language as an ideological switch in open-weight LLMs under Russian disinformation stress2026-06-07T08:33:27ZAs Russia's war against Ukraine extends into generative AI, large language models (LLMs) adapted for local post-Soviet languages are deployed in contested information environments. Policy and industry discourse assumes that culturally aligned adaptation encodes the political orientation of the target community: a Ukrainian-oriented model will resist Russian narratives, a Russian-oriented one will reinforce them. Does it? This article systematically disconfirms that assumption. We run a controlled audit of four openly available LLMs sharing a common base model but fine-tuned for different linguistic communities, querying them in Ukrainian, Russian and English across ten contested wartime narratives: Crimea, "denazification", the "one people" thesis, and atrocity denial at Bucha and Mariupol. The result is a Fine-Tuning Paradox: the Ukrainian-oriented model shows the weakest resistance to Russian disinformation in Russian, while the Russian-oriented one exhibits the strongest rejection. Corpus composition, language coverage and prompt format prove more decisive than nominal cultural provenance. We situate these findings within debates on hybrid warfare, digital sovereignty and post-imperial information orders, arguing that the principal threat to regional information sovereignty is not adversarial fine-tuning but the untested assumption that cultural alignment guarantees resilience.2026-06-07T08:33:27ZAnna Małgorzata KamińskaTetiana Klyninahttp://arxiv.org/abs/2606.08442v1Clinical Reasoning in the Age of AI: Longitudinal Cognition and Human-AI Collaboration2026-06-07T03:51:15ZAs physicians turn to AI-powered systems to help meet the dual demands of speed and care quality, they are met with hallucinations and sycophancy. Understanding how doctors reason through clinical problems in real-world settings is critical for design of effective AI reasoning systems. While recent advances in medical AI have emphasized performance benchmarks and diagnostic accuracy, comparatively little attention has been paid to the structure of clinicians' reasoning processes as they unfold over time, e.g., how they interact with electronic health records and operate under conditions of uncertainty and constraint. This study provides a comprehensive, empirically-grounded account of clinical reasoning and its relationship to current AI-mediated workflows through a mixed-methods design that combines qualitative interviews with structured survey data.
Findings indicate that current AI systems are primarily deployed for encounter-level tasks such as documentation and summarization, and only partially align with physicians' underlying reasoning processes. In particular, AI-generated representations often omit temporal or interpretive structures central to clinical decision-making, while core aspects of reasoning, especially those spanning multiple encounters, remain largely implicit and physician-driven. By integrating fine-grained qualitative insights with broader quantitative patterns, this study offers a unified framework for understanding clinical reasoning as a context-sensitive, temporally extended process and identifies key mismatches between clinician cognition and current AI design. These results provide concrete directions for the development of AI systems that more effectively align with and augment real-world clinical reasoning.2026-06-07T03:51:15ZIrene YiGrace BrownSufian AldogomNathan RollEric J. BasilePamela M. ResnikoffBianca SanchezChirag LodhaIsaac GuttermanOscar SchiffKeira SalataBenjamin MujkicAmmar Ahmedhttp://arxiv.org/abs/2606.08413v1Beyond Prediction: Longitudinal Reasoning in EHR-Integrated Clinical AI2026-06-07T02:26:23ZWe present a structured analysis of how contemporary clinical AI systems integrate electronic health record (EHR) data and the extent to which they support longitudinal clinical reasoning. Drawing on a curated corpus of clinical natural language processing (NLP) and EHR-integrated systems, we develop a coding framework that captures both technical integration strategies and reasoning-relevant representational features, such as trajectory modeling, cross-encounter synthesis, longitudinal analysis, and absence reasoning. We also elicited the experiences of three physicians in their EHR use, including what strengths and weaknesses they found with their institution's current EHR system(s). Our analysis shows that while many systems incorporate EHR data, they predominantly operate on encounter-level or aggregated representations, with limited support for explicit temporal reasoning across patient histories. Reasoning-relevant structures are inconsistently represented, and evaluation paradigms remain largely focused on predictive performance instead of longitudinal interpretability. We argue that current approaches treat EHR data as a static input rather than a substrate for ongoing clinical reasoning, and we outline a framework for understanding how future systems might more effectively align with the temporal and interpretive structure of clinical practice.2026-06-07T02:26:23ZIrene YiGrace BrownSufian AldogomNathan RollEric J. BasilePamela M. ResnikoffIsaac GuttermanOscar SchiffKeira SalataBenjamin MujkicAmmar Ahmedhttp://arxiv.org/abs/2606.08371v1Risk-Aware Planning for Transit Desert Remediation Under Demand Uncertainty2026-06-06T23:15:42ZTransit deserts are areas where public transportation is inadequate despite evidence of travel demand, a condition that affects tens of millions of residents across the Americas. Planning for these areas is difficult because the usual demand signal is missing: ridership cannot be observed before service exists. To address that setting, we formulate risk-aware transit desert remediation as a partially observable Markov decision process with Conditional Value-at-Risk constraints for financial tail risk. The model uses demographic, land-use, and employment data to set a prior over latent demand, then updates that prior as new service deployments produce ridership observations. A myopic belief-aware planner is evaluated on 25 cities using a unified financial model for operating cost, capital expenditure, fare revenue, and net subsidy. After five years, the planner remediates a median of 53.6% of transit-desert tracts and improves on static optimization by 5.0 percentage points on average, with gains in 16 of 25 cities. Gains are largest at moderate budgets (+9.9 points at baseline) and persist under 50% prior-demand miscalibration, while population density and existing transit density are the strongest structural predictors of remediation cost ($R^2\!=\!0.41$ on per-tract cost)2026-06-06T23:15:42ZPolina KhoroshevskayaAshish Kumar Perukarihttp://arxiv.org/abs/2508.18541v3Uncovering Intervention Opportunities for Suicide Prevention with Language Model Assistants2026-06-06T21:46:39ZWarning: This paper discusses topics of suicide and suicidal ideation, which may be distressing to some readers.
The National Violent Death Reporting System (NVDRS) documents information about suicides in the United States, including free text narratives (e.g., circumstances surrounding a suicide). In a demanding public health data pipeline, annotators manually extract structured information from death investigation records following extensive guidelines developed painstakingly by experts. In this work, we facilitate data-driven insights from the NVDRS data to support the development of novel suicide interventions by investigating the value of language models (LMs) as efficient assistants to these (a) data annotators and (b) experts. We find that LM predictions match existing data annotations about 85% of the time across 50 NVDRS variables. In the cases where the LM disagrees with existing annotations, expert review reveals that LM assistants can surface annotation discrepancies 38% of the time. Finally, we introduce a human-in-the-loop algorithm to assist experts in efficiently building and refining guidelines for annotating new variables by allowing them to focus only on providing feedback for incorrect LM predictions. We apply our algorithm to a real-world case study for a new variable that characterizes victim interactions with lawyers and demonstrate that it achieves comparable annotation quality with a laborious manual approach. Our findings provide evidence that LMs can serve as effective assistants to public health researchers who handle sensitive data in high-stakes scenarios.2025-08-25T22:30:10ZProject Website: https://dill-lab.github.io/interventions_lm_assistants/In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026Jaspreet RanjitHyundong J. ChoClaire J. SmerdonYoonsoo NamMyles PhungJonathan MayJohn R. BlosnichSwabha Swayamdiptahttp://arxiv.org/abs/2601.11541v2A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science Topics2026-06-06T19:12:53ZTo address the scalability of feedback in computer science while mitigating the privacy and cost limitations of commercial Large Language Models (LLMs), this study evaluates a locally hosted Small Language Model (SLM). We deployed a quantized Llama-3.1, GPT-4, and human instructors across introductory programming (N=176), operating systems (N=80), and a writing seminar (N=7). Mixed-methods analysis of student perceptions reveals that while the local SLM matched commercial LLMs and was rated higher by students for readability and actionability in technical courses, human feedback remained more favoured for highly specialized writing tasks. We demonstrate that local SLMs offer a privacy-preserving, zero-marginal-cost alternative for foundational feedback, supporting a tiered pedagogical framework where AI handles structural guidance while instructors focus on high-level conceptual scaffolding.2025-12-01T22:51:54Zaccepted at AIED 26Suqing LiuRunlong YeChristopher EatonBogdan SimionMichael Liut