https://arxiv.org/api/t/IqFBwD6oF6vto30oLt4ggFyqM 2026-06-23T06:04:02Z 12727 780 15 http://arxiv.org/abs/2311.05181v6 Energy-efficient flocking with nonlinear navigational feedback 2026-05-10T23:34:39Z

Modeling collective motion in multi-agent systems has gained significant attention. Of particular interest are sufficient conditions for flocking dynamics. We present a generalization of the multi-agent model of Olfati--Saber with nonlinear navigational feedback forces. Unlike the original model, ours is not generally dissipative and lacks an obvious Lyapunov function. We address this by proposing a method to prove the existence of an attractor without relying on LaSalle's principle. Other contributions are as follows. We prove that, under mild conditions, agents' velocities approach the center of mass velocity exponentially, with the distance between the center of mass and the virtual leader being bounded. In the dissipative case, we show existence of a broad class of nonlinear control forces for which the attractor does not contain periodic trajectories, which cannot be ruled out by LaSalle's principle. Finally, we conduct a computational investigation of the problem of reducing propulsion energy consumption by selecting appropriate navigational feedback forces.

2023-11-09T07:34:29Z Oleksandr Dykhovychnyi Alexander Panchenko 10.1007/s11071-024-10527-9 http://arxiv.org/abs/2605.09768v1 SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis 2026-05-10T21:29:32Z

Plant disease diagnosis is critical for food security, yet training disease-recognition models that generalize across crops, pathogens, and field conditions remains challenging because labeled disease images are far less abundant and standardized than data for other biotic stresses such as insects or weeds. Frontier vision-language models offer new opportunities through improved visual reasoning, but they still struggle with fine-grained disease identification due to the lack of structured, crop-specific symptom knowledge. To address this gap, we curate the largest plant disease image--symptom dataset to date, covering 335 crops, 1{,}251 disease classes, and approximately 839K images, designed to support training-free, agentic disease prediction. A scalable automated pipeline generates source-grounded symptom descriptions in which each claim is linked to a verbatim web quote; domain experts validate sampled crops and reconcile disease-name variants across sources. As a baseline, we introduce an autonomous visual reasoning agent that identifies anatomical context, narrows candidate diseases using symptom knowledge, sequentially compares reference images, and produces a fully explainable reasoning trace. Incorporating symptom knowledge improves accuracy by 16.2 percentage points on average at the full reference budget, with consistent gains across all four evaluation crops. Because the framework only requires crop-specific reference images and symptom knowledge, it can be extended to new crops without retraining, while the agentic baseline can directly benefit from future improvements in foundation model capabilities. Dataset and code are available at:https://sage-dataset.github.io/.

2026-05-10T21:29:32Z Muhammad Arbab Arshad Tirtho Roy Yanben Shen Dinakaran Elango Shivani Chiranjeevi Asheesh K. Singh Baskar Ganapathysubramanian Chinmay Hegde Arti Singh Soumik Sarkar http://arxiv.org/abs/2605.11030v1 An Executable Benchmarking Suite for Tool-Using Agents 2026-05-10T21:24:53Z

Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.

2026-05-10T21:24:53Z 20 pages, 2 figures, 20 tables, including appendices Zhiqing Zhong Zhijing Ye Jiamin Wang Xiaodong Yu http://arxiv.org/abs/2605.09734v1 Trajectory Supervision for Continual Tool-Use Learning in LLMs 2026-05-10T20:09:41Z

Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use: when a model learns a stream of new API domains, does keeping tool-use trajectories help compared with stripping the intermediate API trace? We fine-tune Llama 3.1 8B Instruct with QLoRA on API-Bank using four sequential domain blocks. Condition A strips previous API request/response lines from the prompt and trains the model to predict the next API call. Condition B keeps the trajectory context. In a single-seed pilot, full held-out generation evaluation shows that Condition B reaches 56.9\% final exact full-call accuracy compared with 39.2\% for Condition A. B also improves final API-name accuracy by 7.7 points. However, B uses 25.1\% more training tokens, the run uses one seed, and the task is next-call prediction rather than full dialogue success.

2026-05-10T20:09:41Z Vishnu Vardhan Reddy Sagnik Chatterjee Soumik Bhatta http://arxiv.org/abs/2605.09675v1 CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents 2026-05-10T17:45:01Z

Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.

2026-05-10T17:45:01Z Timothy Ossowski Xinchi Liu Danyal Maqbool Vaibhav Dhanuka Sheng Zhang Hoifung Poon Majid Afshar Tyler Bradshaw Junjie Hu http://arxiv.org/abs/2604.16838v2 enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways 2026-05-10T16:22:59Z

We present enclawed, a hard-fork hardening framework built on the OpenClaw AI assistant gateway. enclawed targets deployments that need attestable peer trust, deny-by-default external connectivity, signed-module loading, and a tamper-evident audit trail -- typically regulated industries (financial services, healthcare, defense, government). The framework ships in two flavors: an open flavor preserving OpenClaw compatibility while emitting audit, classification, and data-loss-prevention (DLP) signals, and an enclaved flavor activating strict allowlists, FIPS cryptographic-module assertion, mandatory manifest signature verification, and high-assurance peer attestation for the Model Context Protocol. The classification ladder is data-driven: deployers pick from five built-in presets or supply their own JSON. We ship a 356-case test suite (261 unit + 95 adversarial pen-tests) covering tamper detection, signature forgery, egress bypass, audit-log truncation, trust-root mutation, DLP evasion, prompt injection, code injection, and biconditional admission for net-capable extensions; real-time human-in-the-loop control; a memory-bounded transaction buffer with rollback; strict-mode TypeScript typecheck; and a CI workflow. The biconditional extension-admission gate extends the skill trust schema to non-skill extensions. The four-level verification lattice is now closed at the top: four skill-formal-* primitives plus a CLI produce a signed proof-carrying bundle the runtime re-checks at load, raising a skill from tested to formal via static effect-containment, refinement-typed dispatch, and bounded model checking. enclawed is a hardening framework, not an accredited certification; hardware, validated crypto, facilities, and assessor sign-off remain the deployer's responsibility.

2026-04-18T05:10:11Z Alfredo Metere http://arxiv.org/abs/2605.09610v1 SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications 2026-05-10T15:47:46Z

We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs' literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.

2026-05-10T15:47:46Z Abhinav Goel Agostino Capponi Alfio Gliozzo Chaitya Shah http://arxiv.org/abs/2604.26805v2 Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations 2026-05-10T15:46:25Z

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

2026-04-29T15:35:01Z HomePage: https://benchen4395.github.io Bochao Liu Zhipeng Qian Yang Zhao Xinyuan Jiang Zihan Liang Yufei Ma Junpeng Zhuang Ben Chen Shuo Yang Hongen Wan Yao Wu Chenyi Lei Xiao Liang http://arxiv.org/abs/2605.01805v2 MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning 2026-05-10T14:23:04Z

A key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals requires estimating how one agent's current action affects its teammates over future interaction steps. To address this, we introduce Multi-step Advantage-Gated Interventional Causal MARL (MAGIC), a framework that estimates multi-step action effects between agents and selectively converts them into intrinsic rewards. MAGIC uses counterfactual action interventions to compare teammate futures under factual and counterfactual branches, and introduces a gate based on advantage to direct exploration toward beneficial behaviors aligned with the task goal. Experiments on Multi-Agent Particle Environments (MPE) and StarCraft micromanagement benchmarks (SMAC and SMACv2) show that MAGIC consistently outperforms leading prior methods, with average relative final performance improvements of 26.9% and 10.1%, respectively.

2026-05-03T10:05:48Z Haohan Yu Jinmiao Cong Shengzhi Wang Lu Wang Chanjuan Liu http://arxiv.org/abs/2605.09522v1 Emergent Communication for Co-constructed Emotion Between Embodied Agents via Collective Predictive Coding 2026-05-10T13:14:50Z

According to the theory of constructed emotion, the brain actively forms emotion categories by integrating multimodal bodily signals, and constructs emotional experiences by using these categories to predict and interpret sensory inputs. While research has advanced in modeling individual emotion construction, the social process of co-construction-how a shared understanding of emotions emerges between individuals-remains computationally underexplored. This study investigates this process by modeling emergent communication between two embodied agents using the Metropolis-Hastings Naming Game (MHNG), grounded in the Collective Predictive Coding (CPC) framework. Our experiments, using visual, auditory, and simulated interoceptive inputs, yield two main findings. First, MHNG-based communication significantly improves the alignment, clarity, and inter-agent agreement of the learned emotion categories compared to non-communicative and non-selective baselines, with the alignment effect concentrated at the symbolic layer rather than the perceptual latent representation. Second, even when the two agents have systematically divergent interoceptive dynamics, communication still produces robust categorical alignment, with distinct, category-specific reshaping patterns of each agent's emotion categories-consistent with the constructed-emotion view that interoceptive heterogeneity is constitutive of, rather than an obstacle to, shared emotional meaning. These findings provide computational support for the co-constructionist view of emotion and extend the CPC framework from physical to socially-grounded domains.

2026-05-10T13:14:50Z 13 pages, Zehang Zhang Nguyen Le Hoang Tadahiro Taniguchi Takato Horii http://arxiv.org/abs/2602.06733v2 Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding 2026-05-10T12:54:52Z

Multi-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to pairwise message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100$\times$ less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT's attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.

2026-02-06T14:28:54Z Published at ICLR 2026 Rishabh Jain Keisuke Okumura Michael Amir Pietro Lio Amanda Prorok http://arxiv.org/abs/2605.09344v1 PECMAN: Perception-enabled Collaborative Multi-Agent Navigation in Unknown Environments 2026-05-10T05:44:48Z

Most path planners assume fully known, static environments, assumptions that fail when robots navigate in dynamic and partially observable environments. SMART-3D addresses these issues by real-time replanning, where it morphs the underlying RRT* tree whenever new obstacles or structures are discovered in the environment. Instead of rebuilding the tree entirely from scratch, SMART-3D prunes invalid nodes and edges and subsequently repairs the disjoint subtrees at hot-nodes to find a new path, thus providing high computational efficiency for real-time adaptability. We extend SMART-3D to perception-enabled collaborative multi-agent navigation (PECMAN) in unknown environments. PECMAN is built upon distributed tree morphing and shared perception strategies, where each agent reacts to environmental changes and morphs its respective tree to replan its path, while simultaneously broadcasting newly discovered structures to other agents, thus enabling them to proactively replan even in areas that have not yet been explored by them. This approach reduces redundant reactions and unnecessary replannings of the agents due to improved situational awareness. The performance of PECMAN was evaluated by 28,000 multi-agent simulations on seven 2D scenarios with different case studies. The results show that PECMAN achieves up to 52% reduction in the team-completion time, while maintaining near 100% success rates. Finally, PECMAN was tested by real experiments on two autonomous robots in a building environment.

2026-05-10T05:44:48Z Tianchonghui Fang Shaunak Roy Shalabh Gupta http://arxiv.org/abs/2605.09342v1 A Cross-Layered Multi-Drone Coordination for Medical Supply Delivery during Disaster Response Management 2026-05-10T05:43:51Z

Autonomous drone fleets have immense potential in medical supply delivery during disaster incident response. However, coordinating multiple drones in such settings introduces compounding challenges: dynamic environmental hazards such as wind, obstacles, and intermittent network connectivity, constrained energy budgets, and the need to serve patient locations fairly under deadlines and triage-based priority while optimizing schedule utilization. In this paper, we present CEDA, a novel CTDE Deep Q-Network algorithm for cooperative multi-drone medical delivery, designed to jointly optimize triage-priority-aware routing, multi-agent coordination, and energy-efficient navigation under dynamic uncertainty. CEDA introduces a Priority-Preserving Fair Scheduling strategy, in which a structured reward function encodes both triage weights and complementary fairness mechanisms ensuring no patient class is starved of service. We evaluate CEDA in a simulated grid environment featuring dynamic hazard zones, stochastic action failures, and dynamically spawning patients across three triage priority levels, as well as in a PX4 SITL validation using two X500 quadrotors controlled via MAVSDK in offboard position mode. Simulation results demonstrate that CEDA achieves a delivery completion rate above 85%, reduces obstacle collisions by over 90% across training, and delivers an average of 6 patients per episode with a triage efficiency of 0.82. CEDA preserves clinical priority ordering, Critical patients are served first, while achieving near-zero mortality across lower-triage classes, confirming that priority-weighted routing does not condemn Stable or Urgent patients to neglect. PX4 SITL validation further demonstrates that the learned policy remains executable and triage-coherent under practical communication constraints and realistic multi-drone coordination in disaster response settings.

2026-05-10T05:43:51Z 18 pages, 14 figures, 3 tables Aneesh Calyam Subrahmanya Chandra Bhamidipati Zack Murry Sharan Srinivas http://arxiv.org/abs/2605.04741v2 Hierarchical Multiagent Reinforcement Learning for Multi-Group Tax Game 2026-05-10T04:42:30Z

Reinforcement learning has increasingly been applied to economic decision-making, including taxation, public spending, and labor supply. However, existing RL-based economic models typically consider only a single government-household group, overlooking strategic interactions among competing governments. To address this limitation, we formulate taxation as a hierarchical multi-group game. Within each group, the government and households form a leader--follower game, while governments compete across groups through strategic fiscal policies. This coupled structure is difficult to solve using standard multi-agent reinforcement learning (MARL) methods. We therefore propose a bilevel MARL framework with \textit{Curriculum Learning} and a \textit{Closed-Loop Sequential Update} mechanism to improve training stability and convergence. We instantiate the framework in a taxation simulation environment grounded in classical economic models, supporting the evaluation of taxation policies under inter-group competition. Experiments show that the proposed method learns stable and sustainable tax policies. Compared with a two-group baseline without the proposed mechanisms, our approach avoids premature game collapse, extends the effective game duration by 60.92\%, and reduces GDP disparities among governments by 44.12\%.

2026-05-06T10:37:41Z Honglei Guo Yuhan Zhao Yexin Li http://arxiv.org/abs/2606.14715v1 MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions 2026-05-10T04:03:47Z

LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real-world social simulation. Reddit threads provide public, topic-grounded, multi-party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt-based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM-based social simulation.

2026-05-10T04:03:47Z Yaoning Yu Ye Yu Haojing Luo Haohan Wang