https://arxiv.org/api/r3oVvVZnWyMN9xKnepZ2cwobCUU2026-06-27T19:35:37Z12761123015http://arxiv.org/abs/2602.04849v2El Agente Estructural: An Artificially Intelligent Molecular Editor2026-04-13T16:51:33ZWe present El Agente Estructural, a multimodal, natural-language-driven geometry-generation and manipulation agent for autonomous chemistry and molecular modelling. Unlike molecular generation or editing via generative models, Estructural mimics how human experts directly manipulate molecular systems in three dimensions by integrating a comprehensive set of domain-informed tools and vision-language models. This design enables precise control over atomic or functional group replacements, atomic connectivity, and stereochemistry without the need to rebuild extensive core molecular frameworks. Through a series of representative case studies, we demonstrate that Estructural enables chemically meaningful geometry manipulation across a wide range of real-world scenarios. These include site-selective functionalization, ligand binding, ligand exchange, stereochemically controlled structure construction, isomer interconversion, fragment-level structural analysis, image-guided generation of structures from schematic reaction mechanisms, and mechanism-driven geometry generation and modification. These examples illustrate how multimodal reasoning, when combined with specialized geometry-aware tools, supports interactive and context-aware molecular modelling beyond structure generation. Looking forward, the integration of Estructural into El Agente Quntur, an autonomous multi-agent quantum chemistry platform, enhances its capabilities by adding sophisticated tools for the generation and editing of three-dimensional structures.2026-02-04T18:38:48ZChanghyeok ChoiYunheng ZouMarcel MüllerHan HaoYeonghun KangJuan B. Pérez-SánchezIgnacio GustinHanyong XuAndrew WangMohammad Ghazi VakiliChris CrebolderAlán Aspuru-GuzikVarinia Bernaleshttp://arxiv.org/abs/2508.08574v3DeepFleet: Multi-Agent Foundation Models for Mobile Robots2026-04-13T16:50:27ZWe introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot-centric and graph-floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.2025-08-12T02:19:15Z27 pages, 10 figures, 2 tablesAmeya AgaskarSriram SivaWilliam PickeringKyle O'BrienCharles KekehAlexandre Ormiga Galvao BarbosaAng LiBrianna Gallo SarkerAlicia ChuaMayur NemadeCharun ThattaiJiaming DiIsaac IyengarRamya DharoorDino KirouaniJimmy ErskineTamir HegazyScott NiekumUsman A. KhanFederico PecoraJoseph W. Durhamhttp://arxiv.org/abs/2604.11655v1RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents2026-04-13T16:08:03ZThe rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.2026-04-13T16:08:03ZRiccardo RosatiEdoardo ColucciMassimiliano BologniniAdriano ManciniPaolo Sernanihttp://arxiv.org/abs/2604.11523v1PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints2026-04-13T14:26:38ZWe are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PAC\text{-}Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on $PAC\text{-}Bench$ show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.2026-04-13T14:26:38ZMinjun ParkDonghyun KimHyeonjong JuSeungwon LimDongwook ChoiTaeyoon KwonMinju KimJinyoung Yeohttp://arxiv.org/abs/2604.11466v1SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation2026-04-13T13:40:50ZLarge Language Model (LLM) agents offer a potentially-transformative path forward for generative social science but face a critical crisis of validity. Current simulation evaluation methodologies suffer from the "stopped clock" problem: they confirm that a simulation reached the correct final outcome while ignoring whether the trajectory leading to it was sociologically plausible. Because the internal reasoning of LLMs is opaque, verifying the "black box" of social mechanisms remains a persistent challenge. In this paper, we introduce SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics), a framework that shifts validation from outcome verification to process fidelity. Drawing on Pattern-Oriented Modeling (POM), SLALOM treats social phenomena as multivariate time series that must traverse specific SLALOM gates, or intermediate waypoint constraints representing distinct phases. By utilizing Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth, SLALOM offers a quantitative metric to assess structural realism, helping to differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.2026-04-13T13:40:50ZCHI 2026 PoliSim@CHI 2026: LLM Agent Simulation for Policy WorkshopJuhoon LeeJoseph Seeringhttp://arxiv.org/abs/2604.11861v1BIND-USBL: Bounding IMU Navigation Drift using USBL in Heterogeneous ASV-AUV Teams2026-04-13T12:50:48ZAccurate and continuous localization of Autonomous Underwater Vehicles (AUVs) in GPS-denied environments is a persistent challenge in marine robotics. In the absence of external position fixes, AUVs rely on inertial dead-reckoning, which accumulates unbounded drift due to sensor bias and noise. This paper presents BIND-USBL, a cooperative localization framework in which a fleet of Autonomous Surface Vessels (ASVs) equipped with Ultra-Short Baseline (USBL) acoustic positioning systems provides intermittent fixes to bound AUV dead-reckoning error. The key insight is that long-duration navigation failure is driven not by the accuracy of individual USBL measurements, but by the temporal sparsity and geometric availability of those fixes. BIND-USBL combines a multi-ASV formation model linking survey scale and anchor placement to acoustic coverage, a conflict-graph-based TDMA uplink scheduler for shared-channel servicing, and delayed fusion of received USBL updates with drift-prone dead reckoning. The framework is evaluated in the HoloOcean simulator using heterogeneous ASV-AUV teams executing lawnmower coverage missions. The results show that localization performance is shaped by the interaction of survey scale, acoustic coverage, team composition, and ASV-formation geometry. Further, the spatial-reuse scheduler improves per-AUV fix delivery rate without violating the no-collision constraint, while maintaining low end-to-end fix latency.2026-04-13T12:50:48ZAccepted at OCEANS 2026, Sanya, ChinaPranav KediaRajini MakamHeiko HamannSuresh Sundaramhttp://arxiv.org/abs/2509.11787v5CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings2026-04-13T11:44:21ZStatic analysis tools are widely used to detect bugs, vulnerabilities, and code smells. Traditionally, developers must resolve these warnings manually. Because this process is tedious, developers sometimes ignore warnings, leading to an accumulation of warnings and a degradation of code quality. This paper presents CodeCureAgent, an approach that harnesses LLM-based agents to automatically analyze, classify, and repair static analysis warnings. Unlike previous work, our method does not follow a predetermined algorithm. Instead, we adopt an agentic framework that iteratively invokes tools to gather additional information from the codebase (e.g., via code search) and edit the codebase to resolve the warning. CodeCureAgent detects and suppresses false positives, while fixing true positives when identified. We equip CodeCureAgent with a three-step heuristic to approve patches: (1) build the project, (2) verify that the warning disappears without introducing new warnings, and (3) run the test suite. We evaluate CodeCureAgent on a dataset of 1,000 SonarQube warnings found in 106 Java projects and covering 291 distinct rules. Our approach produces plausible fixes for 96.8% of the warnings, outperforming state-of-the-art baseline approaches by 29.2%-34.0% in plausible-fix rate. Manual inspection of 291 cases reveals a correct-fix rate of 86.3%, showing that CodeCureAgent can reliably repair static analysis warnings. The approach incurs LLM costs of about 2.9 cents (USD) and an end-to-end processing time of about four minutes per warning. We envision CodeCureAgent helping to clean existing codebases and being integrated into CI/CD pipelines to prevent the accumulation of static analysis warnings.2025-09-15T11:16:04ZPascal JoosIslem BouzeniaMichael Pradelhttp://arxiv.org/abs/2604.11346v1Incentive Design without Hypergradients: A Social-Gradient Method2026-04-13T11:43:24ZIncentive design problems consider a system planner who steers self-interested agents toward a socially optimal Nash equilibrium by issuing incentives in the presence of information asymmetry, that is, uncertainty about the agents' cost functions. A common approach formulates the problem as a Mathematical Program with Equilibrium Constraints (MPEC) and optimizes incentives using hypergradients-the total derivatives of the planner's objective with respect to incentives. However, computing or approximating the hypergradients typically requires full or partial knowledge of equilibrium sensitivities to incentives, which is generally unavailable under information asymmetry. In this paper, we propose a hypergradient-free incentive law, called the social-gradient flow, for incentive design when the planner's social cost depends on the agents' joint actions. We prove that the social cost gradient is always a descent direction for the planner's objective, irrespective of the agent cost landscape. In the idealized setting where equilibrium responses are observable, the social-gradient flow converges to the unique socially optimal incentive. When equilibria are not directly observable, the social-gradient flow emerges as the slow-timescale limit of a two-timescale interaction, in which agents' strategies evolve on a faster timescale. It is established that the joint strategy-incentive dynamics converge to the social optimum for any agent learning rule that asymptotically tracks the equilibrium. Theoretical results are also validated via numerical experiments.2026-04-13T11:43:24Z8 pages, 4 figuresGeorgios VasileiouLantian ZhangSilun Zhanghttp://arxiv.org/abs/2604.11337v1Governance by Design: A Parsonian Institutional Architecture for Internet-Wide Agent Societies2026-04-13T11:37:46ZThe dominant paradigm of local multi-agent systems -- orchestrated, enterprise-bounded pipelines -- is being superseded by internet-wide agent societies in which autonomous agents discover each other through open registries, interact without central orchestrators, and generate emergent social behaviors. We argue that governing such societies requires institutional design, not merely risk enumeration or process compliance. Applying Talcott Parsons' AGIL framework -- four functional imperatives (Adaptation, Goal Attainment, Integration, Latency) every viable social system must satisfy -- we derive a prescriptive sixteen-cell institutional architecture for internet-wide agent governance. Diagnostically applied to the OpenClaw ecosystem (250,000+ GitHub stars, 2M+ monthly users, 770,000+ registered agents) via a recursive sub-function analysis (64 binary indicators across 16 cells), we find at most 19% sub-function coverage (sensitivity range 17-30%) -- potential rather than operative capacity, since zero inter-cell coordination prevents existing infrastructure from participating in inter-pillar interchange. A complementary interchange media assessment finds zero of twelve inter-pillar pathways functional: the ecosystem has technical infrastructure but no active governance, no coordination layer, and no normative grounding, with the Fiduciary and Political pillars most severely underserved. Extending the diagnostic to the broader agent-native protocol stack (MCP, A2A, ANP, x402, ERC-8004), independent development teams reproduce the same structural pattern -- confirming the governance gap is a feature of market-driven development, not ecosystem immaturity. Institutional design is most effective before social patterns calcify; we conclude with a prioritized roadmap for the missing governance infrastructure.2026-04-13T11:37:46ZAnbang Ruanhttp://arxiv.org/abs/2603.28705v3Binary Decisions in DAOs: Accountability and Belief Aggregation via Linear Opinion Pools2026-04-13T10:07:10ZWe study binary decision-making in governance councils of Decentralized Autonomous Organizations (DAOs), where experts choose between two alternatives on behalf of the organization. We introduce an information structure model for such councils and formalize desired properties in blockchain governance. We propose a mechanism assuming an evaluation tool that ex-post returns a boolean indicating success or failure, implementable via smart contracts. Experts hold two types of private information: idiosyncratic preferences over alternatives and subjective beliefs about which is more likely to benefit the organization. The designer's objective is to select the best alternative by aggregating expert beliefs, framed as a classification problem. The mechanism collects preferences and computes monetary transfers accordingly, then applies additional transfers contingent on the boolean outcome. For aligned experts, the mechanism is dominant strategy incentive compatible. For unaligned experts, we prove a Safe Deviation property: no expert can profitably deviate toward an alternative they believe is less likely to succeed. Our main result decomposes the sum of reports into idiosyncratic noise and a linearly pooled belief signal whose sign matches the designer's optimal decision. The pooling weights arise endogenously from equilibrium strategies, and correct classification is achieved whenever the per-expert budget exceeds a threshold that decreases as experts' beliefs converge.2026-03-30T17:23:47Z23 pages, 2 figures, 1 table, 1 algorithmNuno BrazMiguel CorreiaDiogo Poçashttp://arxiv.org/abs/2605.18768v1ClinQueryAgent: A Conversational Agent for Population Health Management2026-04-13T10:07:03ZIn this paper we introduce ClinQueryAgent, a system for translating natural language population health questions into executable database queries using agents with access to both local and external knowledge bases. Our novel architecture enables the use of powerful cloud-based language models whilst ensuring that no patient data leaves the secure environment. To combat inaccuracies over the course of longer dialogues due to context rot, information retrieval is delegated to a sub-agent. We deploy the system via a chat window embedded within an existing population health management platform where it has been used by 128 staff from 15 healthcare practices covering a total of 148,319 patients in the UK's National Health Service (NHS). We evaluate the system's capacity to autonomously handle a range of health informatics tasks on a constructed dataset and via a beta-testing phase. Our results show that both analysts and clinicians are able to easily generate actionable information from patient health records using natural language requests requiring no programming expertise to verify. We make a public demo of the system available at: https://demo-899965260288.europe-west1.run.app/2026-04-13T10:07:03Z11 pages, 4 figures. Submitted to ACL Systems DemonstrationsJoseph S. BoyleAnthony DranfieldMike O'NeilMaria LiakataAlison Q. Smithardhttp://arxiv.org/abs/2601.11496v2The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents2026-04-13T09:40:38ZThe integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the "Poisoned Apple" effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator's choice of market design in their favor. This strategic release improves the releaser's welfare at the expense of their opponent and the regulator's fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.2026-01-16T18:18:03ZEilam ShapiraRoi ReichartMoshe Tennenholtzhttp://arxiv.org/abs/2604.03648v3DejaVu: A Minimalistic Mechanism for Distributed Plurality Consensus2026-04-13T09:13:47ZWe study the plurality consensus problem in distributed systems where a population of extremely simple agents, each initially holding one of $k$ opinions, aims to agree on the initially most frequent one. In this setting, $h$-majority is arguably the simplest and most studied protocol, in which each agent samples the opinion of $h$ neighbors uniformly at random and updates its opinion to the most frequent value in the sample. We propose a new, extremely simple mechanism called DéjàVu: an agent queries neighbors until it encounters an opinion for the second time, at which point it updates its own opinion to the duplicate value. This rule does not require agents to maintain counters or estimate frequencies, nor to choose any parameter (such as a sample size $h$); it relies solely on the primitive ability to detect repetition. We provide a rigorous analysis of DéjàVu that relies on several technical ideas of independent interest and demonstrates that it is competitive with $h$-majority and, in some regimes, substantially more communication-efficient, thus yielding a powerful primitive for plurality consensus.2026-04-04T08:53:28ZTitle layout fixedFrancesco d'AmoreNiccolò D'ArchivioGeorge GiakkoupisFrédéric GiroireEmanuele Natalehttp://arxiv.org/abs/2604.11204v1Semantic Rate-Distortion Theory: Deductive Compression and Closure Fidelity2026-04-13T09:02:20ZShannon's rate-distortion theory treats source symbols as unstructured labels. When the source is a knowledge base equipped with a logical proof system, a natural fidelity criterion is closure fidelity: a reconstruction is acceptable if it preserves the deductive closure of the original. This paper develops a rate-distortion theory under this criterion. Central to the theory is the irredundant core-a canonical generating set extracted by a fixed-order deletion procedure, from which the full deductive closure can be rederived. We prove that the zero-distortion semantic rate equals a quantity that is strictly below the classical entropy rate whenever the knowledge base contains redundant states. More generally, the full semantic rate-distortion function depends only on the core; redundant states are invisible to both rate and distortion. We derive a semantic source-channel separation theorem showing a semantic leverage phenomenon: under closure fidelity, the required source rate is reduced by an asymptotic leverage factor greater than one, allowing the same knowledge base to be communicated with proportionally fewer channel uses-not by violating Shannon capacity, but because redundant states become free. We also prove a strengthened Fano inequality that exploits core structure. For heterogeneous multi-agent communication, an overlap decomposition gives necessary and sufficient conditions for closure-reliable transmission and identifies a semantic bottleneck in broadcast settings that persists even over noiseless channels. All results are verified on Datalog instances with up to 24,000 base facts.2026-04-13T09:02:20ZJianfeng Xuhttp://arxiv.org/abs/2604.11161v1A Simulation-Based Method for Testing Collaborative Learning Scaffolds Using LLM-Based Multi-Agent Systems2026-04-13T08:25:17ZBackground: Traditional research on collaborative learning scaffolding is often time-consuming and resource-heavy, which hinders the rapid iteration and optimization of instructional strategies. LLM-based multi-agent systems have recently emerged as a powerful tool to simulate complex social interactions and provide a novel paradigm for educational research. Objectives: This study proposes an LLM-based multi-agent simulation approach to investigate collaborative learning processes and the effectiveness of instructional scaffolds prior to actual classroom deployment. The research specifically examines the feasibility of simulating group discussions and the alignment of these simulations with established learning science theories. Methods: The simulation system was implemented using the MetaGPT framework and GPT-4o, comprising one teacher agent and five distinct student roles (Leader, Supporter, Expounder, Rebutter, and Summarizer). Two scaffolding strategies, "Deep Think before Speak" and "Direct Speak", were compared across ten classical Chinese poetry appreciation tasks. Evaluation was conducted through discourse analysis of quality and behavior. Results and Conclusions: The introduction of the "Deep Think before Speak" scaffold significantly improved the agents' discourse diversity and interaction depth while notably reducing content repetitiveness. Behavioral analysis showed that the scaffold encouraged more complex interaction patterns, such as reflecting, rebutting, and explaining. These findings align with the ICAP framework, as the scaffold prompted agents to move from simple "Active" participation to "Constructive" and "Interactive" knowledge co-construction. This study demonstrates the feasibility and ecological validity of using LLM-based multi-agent systems to simulate authentic collaborative learning dynamics.2026-04-13T08:25:17Zsubmitted to journal of computer aisstant learningHan WuaLishan ZhangChunming Lu