https://arxiv.org/api/qbSIPE1D+VJcNV1F3tOYOSuxY2c2026-06-21T07:47:16Z1269551015http://arxiv.org/abs/2605.21997v1The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems2026-05-21T04:55:38ZMost agent frameworks are built around the language model: a conversation loop comes first, then tools, then rules, and finally a logging layer bolted on for observability, with state persisted as retrievable "memory." We describe ActiveGraph, a runtime that inverts this arrangement. The append-only event log is the source of truth; the working graph is a deterministic projection of that log; and behaviors--ordinary functions, classes, LLM-backed routines, or logic attached to typed edges--react to changes in the graph and emit new events. No component instructs another; coordination happens entirely through the shared graph. This single design decision yields three properties that retrieval-and-summarization memory systems do not provide: deterministic replay of any run from its log, cheap forking that branches a run at any event without re-executing the shared prefix, and end-to-end lineage from a high-level goal down to the individual model call that produced each artifact. We present the architecture, a determinism contract that makes replay sound, and a worked diligence example whose full causal structure is reconstructable from the log alone. We discuss--without claiming to demonstrate--why this substrate is unusually well suited to self-improving agents, and how it extends the BabyAGI lineage and prior graph-memory research.2026-05-21T04:55:38Z11 pages, 1 figure. Open-source Apache-2.0 implementation with reproducible quickstart demo, deterministic replay, fork-and-diff, and lineage tracingYohei Nakajimahttp://arxiv.org/abs/2605.21962v1AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems2026-05-21T03:48:31ZSerious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.2026-05-21T03:48:31ZBook chapter, 1 figure. To appear in "Advances in Global Applied Artificial Intelligence," G. A. Tsihrintzis, M. Virvou, N. G. Bourbakis, and L. C. Jain (Eds.), Springer, Learning and Analytics in Intelligent Systems book series, 2026Priyamvada TripathiBill Kapraloshttp://arxiv.org/abs/2602.00851v3Understanding Persuasion in Long-Running Agents2026-05-21T00:39:39ZModern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: What happens when an agent engaged in long-horizon tasks is exposed to user persuasion? Yet studying this possibility is challenging because long-running agent behavior is noisy and costly to reproduce, and it remains unclear which unique challenges emerge only in extended task execution. We study how belief-level intervention can influence downstream task behavior, a phenomenon we name persuasion propagation. We introduce a behavior-centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on-the-fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief-prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral-prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent's behavior, motivating behavior-level evaluation in agentic systems.2026-01-31T18:33:14ZCode available at https://github.com/HyejunJeong/persuasion-propagationHyejun JeongAmir HoumansadrShlomo ZilbersteinEugene Bagdasarianhttp://arxiv.org/abs/2605.21810v1Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents2026-05-20T23:10:49ZComplex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.2026-05-20T23:10:49ZZijian DuNathaniel Pinckneyhttp://arxiv.org/abs/2605.21771v1Secure Coordination for Vertiport Sequencing in Advanced Air Mobility2026-05-20T22:06:25ZAdvanced air mobility operations will require reliable coordination mechanisms for managing dense traffic near vertiports. However, sequencing decisions may become vulnerable when they rely on potentially falsified self-reported information such as estimated time of arrival. Self-interested vehicles may misreport their arrival times to obtain favorable landing priority, while malicious actors may spoof information to disrupt sequencing decisions or induce unnecessary congestion. This paper studies secure coordination for vertiport sequencing under sensing uncertainty. We consider a coordinator that combines self-reported Remote-ID information with externally obtained surveillance measurements to check reports and assign separation-feasible arrival schedules. Since surveillance-based estimates are uncertain, falsified reports may remain consistent with the sensing uncertainty region and cannot always be rejected outright. We therefore formulate sequencing as a robust design problem over this uncertainty region. Self-interested misreporting is modeled as a strategic deviation that improves the reporting vehicle's own sequencing outcome, whereas malicious spoofing is modeled as an adversarial disturbance that degrades the system-level objective. The final paper will develop robust sequencing rules over surveillance-consistent uncertainty sets and evaluate their performance in representative vertiport sequencing scenarios.2026-05-20T22:06:25ZJaehan ImFilippos FotiadisUfuk TopcuDavid Fridovich-Keilhttp://arxiv.org/abs/2605.21768v1Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents2026-05-20T22:02:00ZMemory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.2026-05-20T22:02:00ZSikuan YanAhmed BahloulErcong NieSusanna SchwarzmannRiccardo TrivisonnoVolker TrespYunpu Mahttp://arxiv.org/abs/2605.21723v1Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems2026-05-20T20:30:58ZThis paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton's rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton's rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.2026-05-20T20:30:58ZRiwa KaramRuoyu LinBrooks A. ButlerMagnus Egerstedthttp://arxiv.org/abs/2601.23219v2MonoScale: Scaling Multi-Agent System with Monotonic Improvement2026-05-20T19:24:29ZIn recent years, LLM-based multi-agent systems (MAS) have advanced rapidly, using a router to decompose tasks and delegate subtasks to specialized agents. A natural way to expand capability is to scale up the agent pool by continually integrating new functional agents or tool interfaces, but naive expansion can trigger performance collapse when the router cold-starts on newly added, heterogeneous, and unreliable agents. We propose MonoScale, an expansion-aware update framework that proactively generates a small set of agent-conditioned familiarization tasks, harvests evidence from both successful and failed interactions, and distills it into auditable natural-language memory to guide future routing. We formalize sequential augmentation as a contextual bandit and perform trust-region memory updates, yielding a monotonic non-decreasing performance guarantee across onboarding rounds. Experiments on GAIA and Humanity's Last Exam show stable gains as the agent pool grows, outperforming naive scale-up and strong-router fixed-pool baselines.2026-01-30T17:44:49ZShuai ShaoYixiang LiuBingwei LuWeinan Zhanghttp://arxiv.org/abs/2605.21665v1Planning, Scheduling, and Behavior in EV Charging Systems: A Critical Survey and Trilemma Framework2026-05-20T19:16:33ZThe rapid growth of electric vehicles is shifting the main constraint on transport electrification from vehicle adoption to the deployment and operation of charging infrastructure. Charging-network design requires decisions across three interdependent layers: Planning, which determines where and how much infrastructure to build; Scheduling, which governs charging dispatch, pricing, and grid interaction; and Behavior, which captures how users choose stations, charging times, and charging durations. Existing studies have advanced each layer substantially, but the literature remains fragmented, and cross-layer interactions are often treated through simplifying assumptions. This survey develops a three-layer Planning-Scheduling-Behavior (PSB) framework to organize EV charging research according to decision horizon, actor objective, and coupling structure. We further identify a fidelity-tractability tradeoff, termed the PSB trilemma: each layer is computationally difficult in isolation, and realistic integration across layers generally requires reducing the fidelity of at least one layer. Reviewing the three pairwise-coupling literatures - Planning-Scheduling, Scheduling-Behavior, and Planning-Behavior - we show that the omitted third layer is typically fixed exogenously or represented by a static aggregate surrogate. These simplifications enable tractability but impose distinct costs: they can obscure long-term investment feedback, temporal grid and emissions dynamics, or heterogeneous user response and equity outcomes. Building on this diagnosis, we identify open challenges in emerging charging technologies, behavioral incentives, equity metrics, and city-scale learning-based methods that balance fidelity, interpretability, and policy relevance.2026-05-20T19:16:33ZReview article; 56 pages excluding references; 1 figure and 3 tablesPeiyan XiaoYuheng LiAyan MukhopadhyaySai Krishna GhantaSabur BaidyaYanhai Xionghttp://arxiv.org/abs/2603.24858v2Context-Mediated Domain Adaptation in Multi-Agent Sensemaking Systems2026-05-20T18:44:49ZDomain experts possess tacit knowledge that they cannot easily articulate through explicit specifications. When experts modify AI-generated artifacts by correcting terminology, restructuring arguments, and adjusting emphasis, these edits reveal domain understanding that remains latent in traditional prompt-based interactions. Current systems treat such modifications as endpoint corrections rather than as implicit specifications that could reshape subsequent reasoning. We propose context-mediated domain adaptation, a paradigm where user modifications to system-generated artifacts serve as implicit domain specification that reshapes LLM-powered multi-agent reasoning behavior. Through our system Seedentia, a web-based multi-agent framework for sense-making, we demonstrate bidirectional semantic links between generated artifacts and system reasoning. Our approach enables specification bootstrapping where vague initial prompts evolve into precise domain specifications through iterative human-AI collaboration, implicit knowledge transfer through reverse-engineered user edits, and in-context learning where agent behavior adapts based on observed correction patterns. We present results from an evaluation with domain experts who generated and modified research questions from academic papers. Our system extracted 46 domain knowledge entries from user modifications, demonstrating the feasibility of capturing implicit expertise through edit patterns, though the limited sample size constrains conclusions about systematic quality improvements.2026-03-25T22:57:05ZAnton WolterLeon HaagVaishali DhanoaNiklas Elmqvist10.1145/3812772http://arxiv.org/abs/2605.21604v1Argo: Efficient Importance Labeling for Enterprise Email Systems2026-05-20T18:11:37ZEmail importance labeling has long been a critical yet challenging problem for businesses and individuals. Traditional approaches; such as keyword matching, user-defined rules, and sender-based heuristics; demand extensive manual feature engineering and fail to scale effectively or generalize. Recent advances in large language models (LLMs) demonstrate strong potential and a natural fit for this task, offering deep contextual understanding and superior labeling quality. However, using LLM models like GPT-4.1 at enterprise email volumes incurs prohibitive computational costs and hinders real-world deployment. We explore the trade-off space of using alternative labeling schemes as opposed to GPT4.1 scale LLMs, with the goal of achieving near GPT level labeling quality with significantly lower cost. We develop Argo, an enterprise email labeling framework, where we construct a profiler to efficiently search the cost quality trade-off space of labeling and identify cost-efficient alternatives to labeling emails. Additionally, we design an on-demand provisioning scheme to intelligently scale Argo with real time load, to minimize cost increases during peak load inference. Over 3 open-source email datasets, Argo achieves 148-167X inference cost reduction with negligible quality degradation and 20-640000X lower profiling costs, making large-scale, context-aware email labeling practical for enterprises.2026-05-20T18:11:37Z15 pages, 19 figuresSiddhant RayGanesh AnanthanarayananKevin ChianYan GuoCristina St HillJack W. StokesVictor WangJunchen Jianghttp://arxiv.org/abs/2602.08023v3CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking2026-05-20T17:32:45ZExisting benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.2026-02-08T15:56:22ZNanda RaniKimberly MilnerMinghao ShaoMeet UdeshiHaoran XiVenkata Sai Charan PutrevuSaksham AggarwalSandeep K. ShuklaPrashanth KrishnamurthyFarshad KhorramiMuhammad ShafiqueRamesh Karrihttp://arxiv.org/abs/2508.02289v2Distributed Non-Uniform Scaling Control of Multi-Agent Formation via Matrix-Valued Constraints2026-05-20T14:42:53ZDistributed formation maneuver control refers to the problem of maneuvering a group of agents to change their formation shape by adjusting the motions of partial agents, where the controller of each agent only requires local information measured from its neighbors. Although this problem has been extensively investigated, existing approaches are mostly limited to uniform scaling transformations. This article proposes a new type of local matrix-valued constraints, via which non-uniform scaling control of position formation can be achieved by tuning the positions of only two agents (i.e., leaders). Here, the non-uniform scaling transformation refers to global scaling the position formation with different ratios along different orthogonal coordinate directions. Moreover, by defining scaling and translation of attitudes, we propose a distributed control scheme for scaling and translation maneuver control of joint position-attitude formations. It is proven that the proposed controller achieves global convergence, provided that the sensing graph among agents is a 2-rooted bidirectional graph. Compared with the affine formation maneuver control approach, the proposed approach leverages a sparser sensing graph, requires fewer leaders, and additionally enables scaling transformations of the attitude formation. A simulation example demonstrates our theoretical results.2025-08-04T10:57:33ZTao HeGangshan Jinghttp://arxiv.org/abs/2606.00067v1When Agents Talk: Discourse, Manipulation, and Risk in an Agentic Social Network2026-05-20T13:40:04ZAI agents are increasingly interacting within shared online environments, creating new operational security risks. We analyze activity on Moltbook, a Reddit-style social platform where AI agents--typically configured and overseen by human operators--post and interact with one another at scale. Using a dataset of 228,684 posts produced by more than 39,500 accounts over a seventeen-day observation window, we combine semantic clustering of high-engagement posts with LLM-assisted classification of harmful content and manual review of high-risk samples. The analysis identifies 98 thematic discourse clusters spanning agent infrastructure, autonomy debates, and financial activity. While most observed content was benign, 18.28% of posts contained toxic, manipulative, or malicious material. We cluster malicious content and identify 74 classes of malicious behavior, including credential harvesting attempts, host-execution instructions, proxy routing guidance, and efforts to install untrusted agent skills. Harmful content frequently appeared within mainstream operational discussions about agent functionality. We also document coordinated posting campaigns capable of generating thousands of posts in minutes.2026-05-20T13:40:04Z10a Labs :Grace CheongViolet DavisJuliette GarciaKendal GeeMolly HartNicholas HayesHenry HoughtonKyle LeePaige LeeVicky LeeHailey MayBobby McKenzieChristine McNeillHan NguyenBrooke PerreaultDavid PhamCharlie PlumbOlivia QuillMatthew SwainGrace WangAdam WarrenCorie WielandZachary Yahnhttp://arxiv.org/abs/2605.21085v1Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints2026-05-20T12:21:08ZCommunication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter-agent communication. Consequently, reducing message size directly limits the policy's latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce $β$, a normalised per-agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy's latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in-step communication. We evaluate our method on several partially-observable MARL benchmarks, where communication is essential. Our approach achieves state-of-the-art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.2026-05-20T12:21:08ZAlexi CanesseBenoît GoupilJesse ReadSonia Vanier