https://arxiv.org/api/cpF51MBK/tJgdZplG4WIxqN5F0w2026-03-28T09:04:20Z274703015http://arxiv.org/abs/2603.24302v1A Large-Scale Study of Telegram Bots2026-03-25T13:41:47ZTelegram, initially a messaging app, has evolved into a platform where users can interact with various services through programmable applications, bots. Bots provide a wide range of uses, from moderating groups, helping with online shopping, to even executing trades in financial markets. However, Telegram has been increasingly associated with various illicit activities -- financial scams, stolen data, non-consensual image sharing, among others, raising concerns bots may be facilitating these operations. This paper is the first to characterize Telegram bots at scale, through the following contributions. First, we offer the largest general-purpose message dataset and the first bot dataset. Through snowball sampling from two published datasets, we uncover over 67,000 additional channels, 492 million messages, and 32,000 bots. Second, we develop a system to automatically interact with bots in order to extract their functionality. Third, based on their description, chat responses, and the associated channels, we classify bots into several domains. Fourth, we investigate the communities each bot serves, by analyzing supported languages, usage patterns (e.g., duration, reuse), and network topology. While our analysis discovers useful applications such as crowdsourcing, we also identify malicious bots (e.g., used for financial scams, illicit underground services) serving as payment gateways, referral systems, and malicious AI endpoints. By exhorting the research community to look at bots as software infrastructure, this work hopes to foster further research useful to content moderators, and to help interventions against illicit activities.2026-03-25T13:41:47ZProceedings of the 20th International AAAI Conference on Web and Social Media (ICWSM 2026)Taro TsuchiyaHaoxiang YuTina MarjanovAlice HutchingsNicolas ChristinAlejandro Cuevashttp://arxiv.org/abs/2603.24197v1The First Generation of AI-Assisted Programming Learners: Gendered Patterns in Critical Thinking and AI Ethics of German Secondary School Students2026-03-25T11:20:28ZThe first generation of students is learning to program alongside GenAI (Generative Artificial Intelligence) tools, raising questions about how young learners critically engage with them and perceive ethical responsibilities. While prior research has focused on university students or developers, little is known about secondary school novices, who represent the next cohort of software engineers. To address this gap, we conducted an exploratory study with 84 German secondary school students aged 16-19 attending software development workshops. We examined their critical thinking practices in AI-assisted programming, perceptions of AI ethics and responsibility, and gender-related differences in their views. Our results reveal an AI paradox: students demonstrate strong ethical reasoning and awareness about AI, yet many report integrating AI-generated code without a thorough understanding of it. The majority of our cohort attributed significant responsibility for AI practices to politics and corporations, potentially reflecting Germany's cultural context, with its strict regulations and data privacy discourse. Boys reported more frequent and experimental use of AI-assisted programming, whereas girls expressed greater scepticism and emphasised peer collaboration over GenAI assistance. Our findings highlight the importance of culturally responsive software engineering education that strengthens critical AI literacy in AI-assisted programming by linking ethics to concrete code artefacts and preparing young learners for this AI-driven software landscape.2026-03-25T11:20:28ZIsabella Graßlhttp://arxiv.org/abs/2603.24191v1Integrating Mental Health, Well-Being, and Sustainability into Software Engineering Education2026-03-25T11:14:52ZMental health and well-being are major concerns in higher education and professional fields such as software engineering, yet are often overlooked in curricula. This paper describes our approach to include mental health, well-being, and sustainability in software engineering education in two ways: (1) well-being-focused software projects that ask students to design technical solutions or research addressing mental health and sustainability or societal challenges, and (2) brief classroom interventions such as short reflective discussions and team-building activities. We argue that this combination can help students see software engineering more broadly while creating healthier learning environments. Our analysis of reflections from 60 students found several positive outcomes: students gained a more human-centred perspective, had more team discussions about mental health, and began to see well-being as inspiration for using software to benefit society and individuals rather than merely as a technical or business tool. By combining technical skills with awareness of well-being, we argue that software engineering education can prepare future developers to be both skilled programmers and responsible professionals who care about human well-being.2026-03-25T11:14:52ZIsabella GraßlBirgit Penzenstadlerhttp://arxiv.org/abs/2603.23990v1From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring2026-03-25T06:38:19ZMonolithic Large Language Models (LLMs) used in educational dialogue often behave as "black boxes," where pedagogical decisions are implicit and difficult to audit, frequently violating instructional constraints by providing answers too early. We introduce the Ensemble of Specialized LLMS (ES-LLMS) architecture that separates decision-making from wording. Pedagogical actions are selected by a deterministic rules-based orchestrator coordinating specialized agents covering tutoring, assessment, feedback, scaffolding, motivation and ethics-guided by an interpretable Bayesian Knowledge Tracing (BKT) student model. An LLM renderer surface-realizes the chosen action in natural language. This design emphasizes reliability and controllability: constraints such as "attempt-before-hint" and hint caps are enforced as explicit rules, and the system logs per-turn agent traces and constraint checks. Validation of pedagogical quality via human expert reviewers (N=6) and a multi-LLM-as-Judge panel (six state-of-the-art models) showed that ES-LLMs were preferred in 91.7% and 79.2% of cases, respectively. The architecture significantly outperformed monolithic baselines across all seven dimensions, particularly in Scaffolding & Guidance, and Trust & Explainability. Furthermore, a Monte Carlo simulation (N=2,400) exposed a "Mastery Gain Paradox," where monolithic tutors inflated short-term performance through over-assistance. In contrast, ES-LLMs achieved 100% adherence to pedagogical constraints (e.g., attempt-before-hint) and a 3.3x increase in hint efficiency. Operationally, ES-LLMs reduced costs by 54% and latency by 22% by utilizing stateless prompts. We conclude that structural decoupling is essential for transforming stochastic models into trustworthy, verifiable and resource-efficient pedagogical agents.2026-03-25T06:38:19ZAccepted as a FULL paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026). 15 pages, 4 figures, 4 tablesNizam Kadirhttp://arxiv.org/abs/2603.20957v2Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models2026-03-25T04:16:40ZFrontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.2026-03-21T21:46:16ZPreprint Under ReviewXinyue LiuNiloofar MireshghallahJane C. GinsburgTuhin Chakrabartyhttp://arxiv.org/abs/2603.21106v2Tracing Users' Privacy Concerns Across the Lifecycle of a Romantic AI Companion2026-03-25T03:45:23ZRomantic AI chatbots have quickly attracted users, but their emotional use raises concerns about privacy and safety. As people turn to these systems for intimacy, comfort, and emotionally significant interaction, they often disclose highly sensitive information. Yet the privacy implications of such disclosure remain poorly understood in platforms shaped by persistence, intimacy, and opaque data practices. In this paper, we examine public Reddit discussions about privacy in romantic AI chatbot ecosystems through a lifecycle lens. Analyzing 2,909 posts from 79 subreddits collected over one year, we identify four recurring patterns: disproportionate entry requirements, intensified sensitivity in intimate use, interpretive uncertainty and perceived surveillance, and irreversibility, persistence, and user burden. We show that privacy in romantic AI is best understood as an evolving socio-technical governance problem spanning access, disclosure, interpretation, retention, and exit. These findings highlight the need for privacy and safety governance in romantic AI that is staged across the lifecycle of use, supports meaningful reversibility, and accounts for the emotional vulnerability of intimate human-AI interaction.2026-03-22T07:49:59Z16 pages, 1 figure, in submission at a conferenceKazi Ababil AzamImtiaz KarimDipto Dashttp://arxiv.org/abs/2603.23863v1Generative AI User Experience: Developing Human--AI Epistemic Partnership2026-03-25T02:40:13ZGenerative AI (GenAI) has rapidly entered education, yet its user experience is often explained through adoption-oriented constructs such as usefulness, ease of use, and engagement. We argue that these constructs are no longer sufficient because systems such as ChatGPT do not merely support learning tasks but also participate in knowledge construction. Existing theories cannot explain why GenAI frequently produces experiences characterized by negotiated authority, redistributed cognition, and accountability tension. To address this gap, this paper develops the Human--AI Epistemic Partnership Theory (HAEPT), explaining the GenAI user experience as a form of epistemic partnership that features a dynamic negotiation of three interlocking contracts: epistemic, agency, and accountability. We argue that findings on trust, over-reliance, academic integrity, teacher caution, and relational interaction about GenAI can be reinterpreted as tensions within these contracts rather than as isolated issues. Instead of holding a single, stable view of GenAI, users adjust how they relate to it over time through calibration cycles. These repeated interactions account for why trust and skepticism often coexist and for how partnership modes describe recurrent configurations of human--AI collaboration across tasks. To demonstrate the usefulness of HAEPT, we applied it to analyze the UX of collaborative learning with AI speakers and AI-facilitated scientific argumentation, illustrating different contract configurations.2026-03-25T02:40:13ZXiaoming Zhaihttp://arxiv.org/abs/2603.23857v1When AI output tips to bad but nobody notices: Legal implications of AI's mistakes2026-03-25T02:34:47ZThe adoption of generative AI across commercial and legal professions offers dramatic efficiency gains -- yet for law in particular, it introduces a perilous failure mode in which the AI fabricates fictitious case law, statutes, and judicial holdings that appear entirely authentic. Attorneys who unknowingly file such fabrications face professional sanctions, malpractice exposure, and reputational harm, while courts confront a novel threat to the integrity of the adversarial process. This failure mode is commonly dismissed as random `hallucination', but recent physics-based analysis of the Transformer's core mechanism reveals a deterministic component: the AI's internal state can cross a calculable threshold, causing its output to flip from reliable legal reasoning to authoritative-sounding fabrication. Here we present this science in a legal-industry setting, walking through a simulated brief-drafting scenario. Our analysis suggests that fabrication risk is not an anomalous glitch but a foreseeable consequence of the technology's design, with direct implications for the evolving duty of technological competence. We propose that legal professionals, courts, and regulators replace the outdated `black box' mental model with verification protocols based on how these systems actually fail.2026-03-25T02:34:47ZDylan J. RestrepoNicholas J. RestrepoFrank Y. HuoNeil F. Johnsonhttp://arxiv.org/abs/2603.24625v1SolRugDetector: Investigating Rug Pulls on Solana2026-03-25T02:31:31ZSolana has experienced rapid growth due to its high performance and low transaction costs, but the extremely low barrier to token issuance has also led to widespread Rug Pulls. Unlike Ethereum-based Rug Pulls that rely on malicious smart contracts, the unified SPL Token program on Solana shifts fraudulent behaviors toward on-chain operations such as market manipulation. However, existing research has not yet conducted a systematic analysis of these specific Rug Pull patterns on Solana. In this paper, we present a comprehensive empirical study of Rug Pulls on Solana. Based on 68 real-world incident reports, we construct and release a manually labeled dataset containing 117 confirmed Rug Pull tokens and characterize the workflow of Rug Pulls on Solana. Building on this analysis, we propose SolRugDetector, a detection system that identifies fraudulent tokens solely using on-chain transaction and state data. Experimental results show that SolRugDetector outperforms existing tools on the labeled dataset. We further conduct a large-scale measurement on 100,063 tokens newly issued in the first half of 2025 and identify 76,469 Rug Pull tokens. After validating the in-the-wild detection results, we release this dataset and analyze the Rug Pull ecosystem on Solana. Our analysis reveals that Rug Pulls on Solana exhibit extremely short lifecycles, strong price-driven dynamics, severe economic losses, and highly organized group behaviors. These findings provide insights into the Solana Rug Pull landscape and support the development of effective on-chain defense mechanisms.2026-03-25T02:31:31ZJiaxin ChenZiwei LiZigui JiangRuihong HeYantong ZhouJiajing WuZibin Zhenghttp://arxiv.org/abs/2603.23848v1BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents2026-03-25T02:09:35ZLLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot.
BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences.
We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates.
We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).2026-03-25T02:09:35ZPraveen Kumar MyakalaManan AgrawalRahul Manchehttp://arxiv.org/abs/2601.06634v2Kara-Kichwa Data Sovereignty Framework: Reference Point for Indigenous Data Authority Renaissances in LAC2026-03-25T01:39:21ZFor Indigenous Peoples of the Apya Yala (or Abya Yala), particularly in the Kara and Kichwa citizens of the Pan-Andean-Amazonian biocultural region, data is not merely a knowledge or information resource, it is the extension of Khipu Panaka (Indigenous data authority), treading the data lifecycle, genealogical and relational memory held within customary law and collective responsibility. This perspective paper presents the Kara-Kichwa Data Sovereignty Framework, a living legal-ethical instrument developed through autopoietic Indigenous storytelling, rights to story and place, and Indigenous-informed scope review to engage with external Indigenous data frameworks, counteracting intellectual gentrification and the systemic invisibility of Andean-Amazonian Indigenous Peoples within global digital transformation. The framework codifies five customary pillars, Kamachy (self-determination, community owns data about itself), Aylu-laktapak kamachy (collective authority and polygovernance), Tantanakuy (collective deliberation and relational accountability), Wilay-panka-tantay (physical custody of data and knowledge confidentiality), and Sumak kawsay (biocultural ethics and intergenerational responsibility), to guide the data lifecycle from generation to expiration. While this framework arises from Kara-Kichwa customary law, the pillars outline how its governance logics serve as a reference point for Indigenous data authority renaissances in Latin America and the Caribbean (LAC), through respectful adaptation by other Indigenous nations on their own terms.2026-01-10T17:36:53Z8 pages, 3 figures, submitted manuscript under review processWariNkwi K. FloresKunTikzi FloresRosa M. PanamaKayaKanti Altahttp://arxiv.org/abs/2603.23802v1How are AI agents used? Evidence from 177,000 MCP tools2026-03-25T00:25:49ZToday's AI agents are built on large language models (LLMs) equipped with tools to access and modify external environments, such as corporate file systems, API-accessible platforms and websites. AI agents offer the promise of automating computer-based tasks across the economy. However, developers, researchers and governments lack an understanding of how AI agents are currently being used, and for what kinds of (consequential) tasks. To address this gap, we evaluated 177,436 agent tools created from 11/2024 to 02/2026 by monitoring public Model Context Protocol (MCP) server repositories, the current predominant standard for agent tools. We categorise tools according to their direct impact: perception tools to access and read data, reasoning tools to analyse data or concepts, and action tools to directly modify external environments, like file editing, sending emails or steering drones in the physical world. We use O*NET mapping to identify each tool's task domain and consequentiality. Software development accounts for 67% of all agent tools, and 90% of MCP server downloads. Notably, the share of 'action' tools rose from 27% to 65% of total usage over the 16-month period sampled. While most action tools support medium-stakes tasks like editing files, there are action tools for higher-stakes tasks like financial transactions. Using agentic financial transactions as an example, we demonstrate how governments and regulators can use this monitoring method to extend oversight beyond model outputs to the tool layer to monitor risks of agent deployment.2026-03-25T00:25:49ZMerlin Steinhttp://arxiv.org/abs/2603.03339v3Offline-First Large Language Model Architecture for AI-Assisted Learning with Adaptive Response Levels in Low-Connectivity Environments2026-03-24T20:54:10ZArtificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.2026-02-14T09:53:40ZThere are mistakes, inaccurate information recorded about user responses, and the response timesJoseph WalusimbiAnn Move OgutiJoshua Benjamin SsentongoKeith Ainebyonahttp://arxiv.org/abs/2512.03088v2How DeFi Protocols Choose Oracle Providers: Evidence on Sourcing, Dependence, and Switching Costs2026-03-24T18:59:05ZAs data is an essential asset for any DeFi application, selecting an oracle is a critical decision for its success. To date, academic research has mainly focused on improving oracle technology and internal economics, while the drivers of oracle choice on the client side remain largely unexplored. This study addresses this gap by gathering insights from leading DeFi protocols, uncovering their rationale for oracle selection and their preferences regarding whether to outsource or internalize data-request mechanisms. Data are collected from founders, C-level executives, and oracle engineers of 32 DeFi protocols, whose combined total value locked (TVL) exceeds 55% of the oracle-using DeFi segment. The study leverages a one-time mixed-method survey, using tailored question paths for in-house versus third-party oracle users. Quantitative answers are summarized, compared across groups, and examined through Spearman rank-order correlations to explore pairwise associations among evaluation dimensions, while open-ended responses are inductively coded into keywords and broader themes to triangulate common selection motives and switching challenges. Insights support the view that protocol choices are tied to technological dependencies, in which the immutability of smart contracts amplifies lock-in, hindering agile switching among data providers. Furthermore, when viable third-party solutions exist, protocols generally prefer to outsource rather than build and maintain internal oracle mechanisms.2025-11-29T10:28:57ZNot peer reviewedGiulio Caldarellihttp://arxiv.org/abs/2603.23485v1Failure of contextual invariance in gender inference with large language models2026-03-24T17:52:22ZStandard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.2026-03-24T17:52:22ZSagar KumarAriel FlintLuca Maria AielloAndrea Baronchelli