https://arxiv.org/api/bGz4+5zBXrQ1JVD1R3usSv3Jprk2026-06-27T05:14:11Z12761102015http://arxiv.org/abs/2604.26220v1When Agents Shop for You: Role Coherence in AI-Mediated Markets2026-04-29T01:54:07ZConsumers are increasingly delegating purchase decisions to AI agents, providing natural-language descriptions of their preferences and identity. We argue that these representations constitute an information channel, role coherence, through which sellers can infer willingness to pay without explicit disclosure by the buyer agent, leading to preference leakage. In an experiment where a language-model buyer agent shops on behalf of a verbal consumer profile, we show that seller-side inference from dialogue alone recovers willingness to pay nearly one-for-one. Comparing this setting to a numeric-budget condition with confidentiality instructions cleanly isolates role coherence as distinct from instruction-following failure. Because this leakage arises from delegation itself, it cannot be mitigated at the prompt level. Instead, we propose architectural interventions that trade off personalization against preference privacy.2026-04-29T01:54:07ZSoogand AlaviSalar Nozarihttp://arxiv.org/abs/2604.26997v1Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetes2026-04-29T01:24:27ZAutonomous AI agent ecosystems require stronger mechanisms for secure discovery, identity verification, capability attestation, and policy governance. Current deployments frequently lack (1) uniform agent discovery, (2) cryptographic agent authentication, (3) capability proofs that protect secrets, and (4) enforceable policy controls. This paper presents an implementation-oriented proof of concept for the Agent Name Service (ANS), a DNS-inspired trust layer for AI agent discovery and interoperability in Kubernetes, grounded in the ANS protocol specification~\cite{huang2025ans}. The implementation uses Decentralized Identifiers (DIDs), Verifiable Credentials (VCs), policy-as-code enforcement with Open Policy Agent (OPA), and Kubernetes-native integration patterns (CRDs, admission controls, service mesh integration). In a demo research environment (3-node cluster, 50-agent workflow simulation), we observe sub-10ms response in demonstrated service paths and full success for scripted demo deployment scenarios. We explicitly scope these findings as proof-of-concept evidence rather than production certification. We further provide a threat model, assumptions, and limitations to separate implemented evidence from protocol-defined and roadmap capabilities. The result is an evidence-grounded pathway from ANS protocol concepts to reproducible engineering practice for secure multi-agent systems.2026-04-29T01:24:27Z9 pages, 2 figuresAkshay MittalElyson De La Cruzhttp://arxiv.org/abs/2604.25067v2Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver2026-04-29T00:38:29ZForecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4's time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.2026-04-27T23:48:30ZJoshua SherwoodBen AybarBenjamin Kaplanhttp://arxiv.org/abs/2510.05174v4Emergent Coordination in Multi-Agent Language Models2026-04-28T21:26:55ZWhen are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test -- in a purely data-driven way -- whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent LLM systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent LLM systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.2025-10-05T11:26:41ZInternational Conference on Learning Representations (ICLR 2026)Christoph Riedlhttp://arxiv.org/abs/2604.26091v1Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital2026-04-28T20:10:33ZWe study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.2026-04-28T20:10:33Z18 pages, 6 figures. Public onchain dashboard and supporting documentation linked in paperT. J. BartonChris ConstantakisPatti HausemanAnnie MousAlaska HoffmanBrian BergeronHunter Goodreauhttp://arxiv.org/abs/2606.00043v1Fake Plastic Voters: When Political Parties Can Use AI-Simulated Focus Groups2026-04-28T19:06:15ZPolitical parties strive to understand their electorates, and focus groups are a vital tool in these efforts. AI-enhanced simulation technologies (AESTs) enable synthetic focus groups in a fraction of the time (and cost), raising the question of when and how such simulated evidence can be used in campaign research. This paper develops a decision matrix to help party strategists match research needs to appropriate simulation technologies and to identify when to escalate to hybrid or fully human focus groups. The matrix combines three dimensions: strategic purpose, deployment risk, and empirical grounding of the simulation tool. Strategic purpose is the decisive dimension, as it determines what kind of evidence the focus group is meant to produce: observing how political meanings and identities emerge through interaction (Mode 1) or testing and refining campaign messages (Mode 2). The matrix shows that, given documented failure modes such as sycophancy, persona drift, and the suppression of minority viewpoints, AESTs cannot replace human interaction in Mode 1 at any risk level. Within Mode 2, suitability depends instead on deployment risk and on the empirical grounding. Yet even here, we caution that routine reliance on AESTs may erode the qualitative craft on which sound judgment depends.2026-04-28T19:06:15ZClaudio NovelliJavier Argota Sanchez-VaquerizoJennifer CyrGiuliano FormisanoSimon McDougallGiulia SandriLuciano Floridihttp://arxiv.org/abs/2604.26053v1I Would If I Could: Reasoning about Dynamics of Actions in Multi-Agent Systems2026-04-28T18:43:23ZAutonomous agents acting in realistic Multi-Agent Systems (MAS) should be able to adapt during their execution. Standard strategic logics, such as Alternating-time Temporal Logic (ATL), model agents' state- or history-dependent behaviour. However, the dynamic treatment of agents' available actions and their knowledge of required actions is still rarely addressed. In this paper, we introduce ATL with Dynamic Actions (ATL-D), which models the process of granting and revoking actions, and its extension ATEL-D, which captures how such updates affect agents' knowledge. Beyond the conceptual contribution, we provide several technical results: we analyse the expressivity of our logic in relation to ATL, study its relation to normative systems, and provide complexity results for relevant computational problems.2026-04-28T18:43:23ZThis is an extended version of the paper with the same title that will appear in KR 2026, and which contains a technical appendix with proof detailsRustam GalimullinHermine GrosingerMunyque Mittelmannhttp://arxiv.org/abs/2511.03286v4Characterising Global Platforms: Centralised, Decentralised, Federated, and Grassroots2026-04-28T13:06:02ZGlobal digital platforms are software systems designed to serve entire populations, with some already serving billions of people. We propose atomic transactions-based multiagent transition systems and protocols as a formal framework to study them; introduce essential agents -- minimal sets of agents the removal of which makes communication impossible; and show that the cardinality of essential agents partitions all global platforms into four classes:
1. Centralised -- one (the server)
2. Decentralised -- finite $>1$ (bootstrap nodes)
3. Federated -- infinite but not universal (all servers)
4. Grassroots -- universal (all agents but one)
Our illustrative formal example is a global social network, for which we provide centralised, decentralised, federated, and grassroots specifications via multiagent atomic transactions, and prove they all satisfy the same basic correctness properties, yet have different sets of essential agents as expected. We discuss informally additional global platforms -- currencies, ``sharing economy'' apps, AI, and more.
While this may be the first formal characterisation of centralised, decentralised, and federated global platforms, grassroots platforms have been defined previously, using two incomparable notions. Here, we prove that both definitions imply that all agents are essential, placing grassroots platforms within the broader formal context of all global platforms.
This work provides the first mathematical framework for classifying any global platform -- existing or imagined -- by providing a multiagent atomic-transactions specification of it and determining the cardinality of the minimal set of essential agents in the ensuing multiagent protocol. It thus provides a unifying mathematical approach for the study of global digital platforms, perhaps the most important class of computer systems today.2025-11-05T08:34:12ZEhud Shapirohttp://arxiv.org/abs/2604.25596v1Volitional Multiagent Atomic Transactions: Describing People and their Machines2026-04-28T13:02:49ZFormal models for concurrent and distributed systems describe machines; the people who operate them are either ignored or treated as external environment. Yet key distributed systems -- notably grassroots platforms -- include people operating their personal machines (smartphones), and their faithful description must include the states of both people and machines and how they jointly effect system behaviour.
Here, we propose volitional multiagent atomic transactions -- executed atomically by machines and guarded by their people's volitions -- as a novel mathematical foundation for specifying systems consisting of people operating machines. Each agent's state consists of a volitional state and machine state; a transaction is enabled when the machine precondition holds and the guarding persons are willing. For example, befriending two people is guarded by both; unfriending, by either; voluntary swap of coins and bonds is guarded by both parties, while a payment is guarded by the payer.
We develop the mathematical machinery to express safety and liveness of platforms specified in this framework, and provide example specifications of two grassroots platforms: social networks, and coins and bonds. These specifications are then used by AI to derive working implementations. %
We employ here a novel and simpler definition of `grassroots' that better captures the informal notion -- multiple instances can form and operate independently, yet may coalesce -- and show that the platforms specified here, as well as those hitherto proven grassroots under the original definition, are grassroots under the new definition.2026-04-28T13:02:49ZAndy Lewis-PyeEhud Shapirohttp://arxiv.org/abs/2604.25567v1Should I Replan? Learning to Spot the Right Time in Robust MAPF Execution2026-04-28T12:39:12ZDuring the execution of Multi-Agent Path Finding (MAPF) plans in real-life applications, the MAPF assumption that the fleet's movement is perfectly synchronized does not apply. Since one or more of the agents may become delayed due to internal or external factors, it is often necessary to use a robust execution method to avoid collisions caused by desynchronization. Robust execution methods - such as the Action Dependency Graph (ADG) - synchronize the execution of risky actions, but often at the expense of increased plan execution cost, because it may require some agents to wait for the delayed agents. In such cases, the execution's cost can be reduced while still preserving safety by finding a new plan either by rescheduling (reordering the agents at crossroads) or the more general replanning capable of finding new paths. However, these operations may be costly, and the new plan may not even lead to lower execution cost than the original plan: for example, the two plans may be the exact same. Therefore, we estimate the benefit that can be achieved by single replanning in scenarios with delayed agents given an immediate state of the execution with a fully connected feed-forward neural network. The input to the neural network is a set of newly designed ADG-based features describing the robust execution's state and the impact of potential delays, and the output is an estimated benefit achievable by replanning. We train and test the network on a new labeled dataset containing 12,000 experiments, and we show that our proposed method is capable of reducing the impact of delays by up to 94.6% of the achievable reduction.2026-04-28T12:39:12Z8 pages, 10 figures. Submitted for double-blind review to IEEEDavid ZahrádkaCzech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in PragueFaculty of Electrical Engineering, Czech Technical University in PragueDavid WollerCzech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in PragueDenisa MužíkováCzech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in PragueFaculty of Electrical Engineering, Czech Technical University in PragueMiroslav KulichCzech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in PragueLibor PřeučilCzech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Praguehttp://arxiv.org/abs/2512.13956v3AOI: Context-Aware Multi-Agent Operations via Dynamic Scheduling and Hierarchical Memory Compression2026-04-28T12:10:07ZThe proliferation of cloud-native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI-Oriented Operations), a novel multi-agent collaborative framework that integrates three specialized agents with an LLM-based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real-time system states, (2) a three-layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on synthetic and real-world benchmarks show that AOI achieves 72.4\% context compression while preserving 92.8\% critical information, improves task success to 94.2\%, and reduces MTTR by 34.4\% over the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context-aware autonomous operations, enabling robust management of next-generation IT infrastructures with minimal human intervention.2025-12-15T23:22:02Ztheory part rewrite.\Zishan BaiHanxuan ChenJing LuoZiyi NiEnze GeJiacheng ShiYichao ZhangJiayi GuZhimo HanRiyang BaoJunfeng Haohttp://arxiv.org/abs/2510.02890v2Axiomatisation for an asynchronous epistemic logic with sending and receiving messages2026-04-28T09:32:11ZWe investigate a logic for asynchronous announcements wherein the sending of the messages by the environment is separated from their reception by the individual agents. Both come with different modalities. In the logical semantics, formulas are interpreted in a world of a Kripke model but given a history of prior announcements and receptions that already happened. An axiomatisation AA for such a logic has been given in prior work, for the formulas that are valid when interpreted in the Kripke model before any such announcements have taken place. This axiomatisation is a reduction system wherein one can show that every formula is equivalent to a purely epistemic formula without dynamic modalities for announcements and receptions. We propose a generalisation AA* of this axiomatisation, for the formulas that are valid when interpreted in the Kripke model given any history of prior announcements and receptions of announcements. It does not extend the axiomatisation AA, for example it is no longer valid that nobody has received any message. Unlike AA, this axiomatisation AA* is infinitary and it is not a reduction system.2025-10-03T10:57:42ZPhilippe BalbianiHans van DitmarschClara Lerouvilloishttp://arxiv.org/abs/2604.25972v1A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication2026-04-28T07:50:24ZIn multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actions and converge on their objectives by sharing information. Based on an interaction graph, a subclass of methods employs graph neural networks (GNNs) to learn the communication, enabling agents to improve their internal representations by enriching them with information exchanged. With growing research, we note a lack of explicit structure and framework to distinguish and classify MARL approaches with communication based on GNNs. Thus, this paper surveys recent works in this field. We propose a generalized GNN-based communication process with the goal of making the underlying concepts behind the methods more obvious and accessible.2026-04-28T07:50:24ZRencontres des Jeunes Chercheurs en Intelligence Artificielle (RJCIA), Plate-Forme Intelligence Artificielle (PFIA), Jun 2026, Arras, FranceValentin Cuzin-RambaudLIRIS, UCBLLaetitia MatignonLIRIS, UCBLMaxime MorgeLIRIS, UCBLhttp://arxiv.org/abs/2604.25161v1Where Did It Go Wrong? Capability-Oriented Failure Attribution for Vision-and-Language Navigation Agents2026-04-28T03:08:48ZEmbodied agents in safety-critical applications such as Vision-Language Navigation (VLN) rely on multiple interdependent capabilities (e.g., perception, memory, planning, decision), making failures difficult to localize and attribute. Existing testing methods are largely system-level and provide limited insight into which capability deficiencies cause task failures. We propose a capability-oriented testing approach that enables failure detection and attribution by combining (1) adaptive test case generation via seed selection and mutation, (2) capability oracles for identifying capability-specific errors, and (3) a feedback mechanism that attributes failures to capabilities and guides further test generation. Experiments show that our method discovers more failure cases and more accurately pinpoints capability-level deficiencies than state-of-the-art baselines, providing more interpretable and actionable guidance for improving embodied agents.2026-04-28T03:08:48ZACL 2026Jianming ChenYawen WangJunjie WangXiaofei XieShoubin LiQing WangFanjiang Xuhttp://arxiv.org/abs/2604.25070v1Asymmetric-Information Resource Allocation Games: An LP Approach to Purposeful Deception2026-04-27T23:53:36ZIn this work, we introduce the Deceptive Resource Allocation Game (DRAG), which studies purposeful deception within a Bayesian game framework. In DRAG, a Defender allocates resources across the true asset and several decoys to influence an Attacker's beliefs and actions, with the goal of diverting the Attacker away from the true asset. We seek to characterize purposeful deception, whereby the Defender deceives only when doing so improves its performance. To this end, we solve for the Perfect Bayesian Nash Equilibrium (PBNE) of the corresponding game. We show that, despite the coupled belief-policy interdependence, the problem admits an efficient, non-iterative linear programming formulation. Numerical results demonstrate that the resulting policies naturally balance effective allocation and belief manipulation, giving rise to purposeful and emergent deceptive behaviors.2026-04-27T23:53:36ZLongxu PanYue GuanDaigo ShishikaPanagiotis Tsiotras