https://arxiv.org/api/VX+22T/x2RAnw+oUvY2usOuhriE2026-06-21T14:42:47Z1269560015http://arxiv.org/abs/2605.12920v2Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue2026-05-16T03:44:07ZEffective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.2026-05-13T02:48:14ZVardhan DongreDilek Hakkani-Türhttp://arxiv.org/abs/2605.16784v1Dynamic Deployment of Mobile Charging Trucks During Natural Disaster Evacuation: An Offline-to-Online Framework2026-05-16T03:35:16ZDuring large-scale evacuations, concentrated electric vehicle (EV) charging demand can overload fixed charging stations (FCSs), leading to prolonged waiting time and increased risk exposure. To address this challenge, this study proposes dynamically deploying mobile charging trucks (MCTs) to complement FCSs, and develops an Adaptive Risk-aware MCT Deployment (ARMD) framework for real-time operation. It divides the MCT deployment into two problems: risk-aware allocation of MCTs among FCSs and dynamic routing of MCTs to the assigned FCSs, and solves them under an offline-to-online paradigm. The resource allocation problem is formulated as a decentralized partially observable Markov decision process, and a multi-agent proximal policy optimization (MAPPO)-based policy is developed to coordinate multiple MCTs under decentralized observations. The policy is pre-trained offline in an evacuation simulator and adaptively refined online according to current evacuation context. For routing, a spatio-temporal travel time predictor is developed to support rolling-horizon route updates. The proposed framework is evaluated in a simulated hurricane evacuation environment built using real-world data from Hillsborough County, Florida. Experiments show that ARMD consistently outperforms offline optimization, online heuristic dispatch, and rolling-horizon optimization in reducing risk exposure. For demand perturbation scenarios, ARMD reduces average risk exposure by up to 71.1%, relative to the baseline without MCTs. In the case of fixed e-vehicle charging infrastructure or road link failures, ARMD achieves 39.3% to 60.5% reduction in average risk exposure, with its advantages becoming more pronounced as the severity of disruption increases. These results demonstrate the effectiveness and robustness of ARMD in enhancing mobile charging operations for realistic scenarios of uncertain evacuation conditions.2026-05-16T03:35:16ZRui MaZilin BianKaan Ozbayhttp://arxiv.org/abs/2605.16757v1NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning2026-05-16T02:11:34ZMulti-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.2026-05-16T02:11:34ZHaoran LuLuyang FangWenxuan ZhongPing Mahttp://arxiv.org/abs/2605.16748v1Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation2026-05-16T02:03:53ZRecent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.2026-05-16T02:03:53Z6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository linksACM Conference on AI and Agentic Systems (CAIS '26), May 26-29, 2026, San Jose, CA, USADebanshu DasLavi NigamSunil Kumar Jang BahadurGopala Dhar10.1145/3786335.3813213http://arxiv.org/abs/2605.13900v2Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems2026-05-16T00:36:49ZIn large-scale multi-agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans -- assessing feasibility, aggregate response, and marginal cost -- before committing to one. Lagrangian relaxation separates local decisions through a broadcast cost signal, but the planner still needs the cost-to-utilization response map to explore plan space, and this map depends on population composition that changes across planning cycles. We propose \emph{population-aware coordination interfaces}: learned primal and dual maps, conditioned on compact population summaries, that the planner queries inside its iterative loop. The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan. By encoding response-relevant population structure, these maps remain reliable across evolving populations without per-cycle retraining, and support coordination of large populations from compact subsamples. We additionally cast Sim2Real transfer as a backtestable procedure, enabling evaluation before deployment. In a supply-chain capacity-control case study, population-aware interfaces reduce forecast error by 16--19\% and capacity violations by 20--51\% relative to population-unaware baselines under composition shift; 20K-agent cohorts support accurate coordination of 500K-agent populations; and simulator-trained primal maps achieve 11.1\% MAPE on real observations versus 13--24\% for baselines.2026-05-12T16:57:24Z30 pages, 16 figures. Submitted to NeurIPS 2026Angel WangDominique Perrault-JoncasAlvaro MaggiarCarson EisenachDean Fosterhttp://arxiv.org/abs/2603.20380v2Herding CATs: ALARA for Agent Harness Engineering in Portable Composable Multi-Agent Teams2026-05-15T21:59:54ZIndustry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, but the applications through which users operate these systems do not provide a simple, unified mechanism for scalably managing critical components of the agent harness. This lack of control adversely impacts both the quality of individual human-agent interactions and reduces the capacity for practitioners to coordinate context engineering efforts. The behavioral specifications that define what agents in such systems can do remain fragmented across prose instruction files -- for which compliance cannot be guaranteed -- or framework-internal configurations, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to context, we introduce a context-agent-tool (CAT) data layer expressed through interrelated plain-text files, allowing users to directly declare tool access for each agent and to modify the tools themselves that are used by the agents when processing. We demonstrate capability of this CAT data layer to enable real agentic usage by using a command-line shell that loads the team and executes agent runs -- \texttt{npcsh} -- and evaluating 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation. We characterize which model families succeed in certain task categories and where they break down across $\sim$2500 total executions.2026-03-20T18:00:09ZAccepted to HAXD 2026, 8 pages, 6 figuresChristopher J. AgostinoNayan D'Souzahttp://arxiv.org/abs/2411.14637v5Enhancing Clinical Trial Patient Matching through Knowledge Augmentation and Reasoning with Multi-Agent2026-05-15T20:35:10ZMatching patients effectively and efficiently for clinical trials is a significant challenge due to the complexity and variability of patient profiles and trial criteria. This paper introduces \textbf{Multi-Agent for Knowledge Augmentation and Reasoning (MAKAR)}, a novel multi-agent system that enhances patient-trial matching by integrating criterion augmentation with structured reasoning. MAKAR consistently improves performance by an average of 7\% across different datasets. Furthermore, it enables privacy-preserving deployment and maintains competitive performance when using smaller open-source models. Overall, MAKAR can contributes to more transparent, accurate, and privacy-conscious AI-driven patient matching.2024-11-22T00:07:36ZHanwen ShiJin ZhangKunpeng Zhanghttp://arxiv.org/abs/2605.16598v1GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering2026-05-15T19:59:35ZAgentic retrieval improves multi-hop question answering by giving language models autonomy to iteratively gather evidence. Recent work augments these systems with knowledge graphs for structured traversal, but this combination introduces significant cost: expensive graph construction at index time and compounding token usage at inference time. We introduce Graph Agentic Search over Propositions (GRASP), an agentic system that simultaneously optimizes for high accuracy and minimal token usage in multi-hop question answering. Rather than executing a rigid, singular query, GRASP actively coordinates its retrieval strategy by decomposing multi-hop queries into dependency-aware plans. This enables GRASP to dynamically scale the number of sub-agents according to the complexity of the problem. Each sub-agent resolves its single-hop query by exploring a novel three-layer hierarchical graph of entities, propositions, and passages, using the entity layer for targeted traversal and the proposition layer for high-recall passage retrieval via reciprocal-rank voting. We evaluate GRASP on MuSiQue, 2WikiMultihopQA, and HotpotQA under two settings: open-corpus retrieval and extended context reasoning (LongBench). GRASP achieves the highest QA accuracy in the open retrieval setting on MuSiQue and 2Wiki while using 40-50 percent fewer tokens than IRCoT+HippoRAG2. Furthermore, GRASP leads on EM and F1 across all three datasets in the LongBench setting while using 30 percent fewer tokens than the next most accurate method. Finally, we introduce success economy - the amortized token cost per correct answer, weighted by difficulty - and advocate for efficiency-aware evaluation as a standard practice for agentic QA.2026-05-15T19:59:35ZStockton JenkinsRamya Korlakai VinayakJunjie Huhttp://arxiv.org/abs/2605.09826v2EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents2026-05-15T19:33:36ZTheory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.2026-05-11T00:04:19ZGurusha JunejaDylan LuSaaket AgasheParth DiwaneEdward GunnJayanth SrinivasaGaowen LiuWilliam Yang WangYali DuXin Eric Wanghttp://arxiv.org/abs/2605.16522v1A Mechanistic Model for Collective Motion from Sensorimotor Regularities2026-05-15T18:17:41ZCollective behavior in animals has long been modeled through self-propelled particle models, which reproduce striking group-level phenomena through abstract interaction forces. Yet these models are fundamentally descriptive: they leave open the question of how collective behavior is actually produced. Recent empirical work makes this gap concrete: locusts do not align with neighbors, sensory and cognitive mechanisms mediate interaction instead. A mechanistic model must therefore operate at the sensorimotor level, grounded in what individual organisms can actually perceive, estimate, and physically execute. We present such a model based on a modeling framework from robotics, extended here to collective motion. Each agent perceives neighbors through bearing and apparent-size cues within a limited field of view, maintains uncertain internal state estimates, and selects actions through gradient descent on a desired social distance -- without any prescribed interaction forces. This simple model produces diverse collective behaviors including polarized motion, milling, ring formations, and subgroup fragmentation. A global sensitivity analysis shows that behavioral transitions are governed by sensorimotor parameters corresponding to measurable biological quantities: field of view geometry, sensory noise, turning agility, and memory. Collective behavior can therefore be understood as the emergent outcome of interacting sensorimotor regularities, and differences across species as the emergent outcome of differences in embodiment and environment.2026-05-15T18:17:41ZVito MengersBao Duc CaoOliver Brockhttp://arxiv.org/abs/2605.16233v1FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast2026-05-15T17:42:49ZCan LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.2026-05-15T17:42:49ZIgor BogdanovChung-Horng LungThomas KunzJie GaoAdrian TaylorMarzia Zaman10.1145/3786335.3813155http://arxiv.org/abs/2605.16205v1Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP2026-05-15T17:23:08ZDeploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.2026-05-15T17:23:08ZIgor BogdanovChung-Horng LungThomas KunzJie GaoAdrian TaylorMarzia Zaman10.1145/3786335.3813149http://arxiv.org/abs/2605.16194v1paper.json: A Coordination Convention for LLM-Agent-Actionable Papers2026-05-15T17:10:50ZLLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json --against paper.typ` passes. Repo: https://github.com/arquicanedo/paper-json2026-05-15T17:10:50ZArquimedes Canedohttp://arxiv.org/abs/2605.16144v1MAxLM: Multi-Agent Language Model-Based Scheduling and Resource Allocation in MU-MIMO-OFDMA-Enabled Wireless Networks2026-05-15T16:24:40ZWireless networks support multi-user (MU) communication with multiple-input multiple-output (MIMO) and orthogonal frequency-division multiple access (OFDMA) technologies. In the joint MU-MIMO-OFDMA-enabled transmission mode, network throughput can be significantly increased by effectively utilizing the multi-channel resources to schedule numerous wireless users/stations (STAs) simultaneously. In this paper, we study ways to optimize the user scheduling and resource allocation (SRA) for the UL scheduled access (UL-SA) of a joint MU-MIMO-OFDMA-enabled wireless local area network (WLAN). In particular, we propose a multi-agent (MA) framework that utilizes an openly available pretrained small/medium-sized Language Model (xLM) to perform SRA for the UL-SA. To facilitate autonomous SRA using our proposed technique, we introduce the AI-assisted Wireless Systems Engineering and Research (WiSER) platform. We evaluate the performance of MAxLM-optimized SRA for network scenarios with a varying number of STAs and antenna settings on the WLAN Access Point. Numerical results confirm that our proposed technique achieves higher UL-SA throughput than the benchmark techniques.2026-05-15T16:24:40ZAdnan QuadriHongxiang Lihttp://arxiv.org/abs/2605.16097v1Multi-Agent Cooperative Transportation: Optimal and Efficient Task Allocation and Path Finding2026-05-15T15:52:52ZMulti-robot systems are integral to modern logistics, but their capabilities are often limited to tasks executable by individual agents. This paper addresses a critical gap in existing frameworks like Multi-Agent Path Finding (MAPF) and Task Allocation and Path Finding (TAPF), which lack true cooperation for transporting large items that require multiple agents. To this end, we formalise the Cooperative Transportation Task Allocation and Path Finding (CT-TAPF) problem, which integrates team formation, task assignment, and collision-free pathfinding. We present an optimal solver, Cooperative Transportation Task Conflict-Based Search (CT-TCBS), which features a novel Incremental Expansion strategy to tackle the combinatorial explosion inherent in team formation. Recognising the computational cost of optimality, we also develop a family of sub-optimal solvers that employ a global, task-centric perspective, selecting the next task to assign based on a global difficulty metric (Best Task or Worst Task). Our comprehensive empirical evaluation demonstrates three key findings: (1) the incremental expansion strategy significantly outperforms the naive combinatorial approach by successfully pruning the dominant task-allocation search space; (2) we identify a task-conflict expansion dilemma, where sophisticated conflict resolvers effective for large-agent pathfinding subproblems can be detrimental in the integrated CT-TAPF setting; and (3) our proposed sub-optimal solvers establish a new, more efficient frontier on the solution quality-runtime spectrum compared to "nn-" agent-centric baselines. This work provides a foundational framework and a set of effective algorithms for a new, practical class of cooperative multi-agent problems.2026-05-15T15:52:52ZNing ZhouNikolai W. F. BodeEdmund R. Hunt