https://arxiv.org/api/nwtecVJCsKZWUfc2/miZt0yIiQ4 2026-06-27T14:04:23Z 12761 1155 15 http://arxiv.org/abs/2604.17142v1 Logic-Based Verification of Task Allocation for LLM-Enabled Multi-Agent Manufacturing Systems 2026-04-18T20:33:43Z

Manufacturing industries are facing increasing product variability due to the growing demand for personalized products. Under these conditions, ensuring safety becomes challenging as frequent reconfigurations can lead to unintended hazardous behaviors. Multi-agent control architectures have been proposed to improve flexibility through decentralized decision-making and coordination. However, these architectures are based on predefined task models, which limit their ability to adapt task planning to new product requirements while preserving safety. Recently, large language models have been introduced into manufacturing systems to enhance adaptability, but reliability remains a key challenge. To address this issue, we propose a control architecture that leverages the flexibility of large language models while preserving safety on the manufacturing shop floor. Specifically, the proposed framework verifies large language model-enabled task allocations by using temporal logic and discrete event systems. The effectiveness of the proposed framework is demonstrated through a case study that involves a multi-robot assembly scenario, showing that unsafe tasks can be allocated safely before task execution.

2026-04-18T20:33:43Z Jonghan Lim Mostafa Tavakkoli Anbarani Rômulo Meira-Góes Ilya Kovalenko http://arxiv.org/abs/2604.17139v1 The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration 2026-04-18T20:31:37Z

Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.

2026-04-18T20:31:37Z Jiayuan Liu Shiyi Du Weihua Du Mingyu Guo Vincent Conitzer http://arxiv.org/abs/2604.19818v1 Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI 2026-04-18T20:28:26Z

Agentic AI systems plan, use tools, maintain state, and act across multi-step workflows with external effects, meaning trustworthy deployment can no longer be judged by task completion alone. The current literature remains fragmented across benchmark-centered evaluation, standards-based governance, orchestration architectures, and runtime assurance mechanisms. This paper contributes a bounded evidence synthesis across a manually coded corpus of twenty-four recent sources. The core finding is a governance-to-action closure gap: evaluation tells us whether outcomes were good, governance defines what should be allowed, but neither identifies where obligations bind to concrete actions or how compliance can later be proven. To close that gap, the paper introduces three linked artifacts: (1) a four-layer framework spanning evaluation, governance, orchestration, and assurance; (2) an ODTA runtime-placement test based on observability, decidability, timeliness, and attestability; and (3) a minimum action-evidence bundle for state-changing actions. Across sources, evaluation papers identify safety, robustness, and trajectory-level measurement as open gaps; governance frameworks define obligations but omit execution-time control logic; orchestration research positions the control plane as the locus of policy mediation, identity, and telemetry; runtime-governance work shows path-dependent behavior cannot be governed through prompts or static permissions alone; and action-safety studies show text alignment does not reliably transfer to tool actions. A worked enterprise procurement-agent scenario illustrates how these artifacts consolidate existing evidence without introducing new experimental data.

2026-04-18T20:28:26Z 8 pages, 1 figure, 4 tables Christopher Koch Joshua Andreas Wellbrock http://arxiv.org/abs/2604.17057v1 From Necklaces to Coalitions: Fair and Self-Interested Distribution of Coalition Value Calculations 2026-04-18T16:25:57Z

A key challenge in distributed coalition formation within characteristic function games is determining how to allocate the calculation of coalition values across a set of agents. The number of possible coalitions grows exponentially with the number of agents, and existing distributed approaches may produce uneven or redundant allocations, or assign coalitions to agents that are not themselves members. In this article, we present the \emph{Necklace-based Distributed Coalition Algorithm} (N-DCA), a communication-free algorithm in which each agent independently determines its own coalition value calculation allocation using only its identifier and the total number of agents. The approach builds on the notion of Increment Arrays (IAs), for which we develop a complete mathematical framework: equivalence classes under circular shifts, periodic IAs, and a rotated designation scheme with formal load-balance guarantees (tight bounds). We establish a bijection between canonical representative IAs and two-colour combinatorial necklaces, enabling the use of efficient necklace generation algorithms to enumerate allocations in constant amortised time. N-DCA is, to the best of our knowledge, the only distributed coalition value calculation algorithm for unrestricted characteristic function games to provably satisfy five desirable properties: no inter-agent communication, equitable allocation, no redundancy, balanced load, and self-interest. An empirical evaluation against DCVC (Rahwan and Jennings 2007) demonstrates that, although DCVC is faster by a constant factor, this difference becomes negligible under realistic characteristic-function evaluation costs, while N-DCA offers advantages in working memory, scalability, and the self-interest guarantee.

2026-04-18T16:25:57Z 69 pages Terry R. Payne Luke Riley http://arxiv.org/abs/2511.01421v2 Controlling Traffic without Tolls: A Non-Monetary Framework for Autonomous Intersections 2026-04-18T14:19:20Z

The increasing complexity of urban transportation systems, driven by connected and automated vehicles, calls for new modeling paradigms and scalable control strategies. We propose a non-monetary control framework that leverages autonomous intersection management to influence routing decisions without tolls. The approach uses timestamp-based scheduling adjustments at roadside units (RSUs) to introduce path-dependent delays or advancements, steering traffic toward socially efficient flows. We develop a hierarchical architecture that separates real-time intersection control from network-level coordination. The resulting model admits a congestion-game formulation with path-dependent node costs. We establish the existence and essential uniqueness of equilibrium flows, eliminating ambiguities due to multiple equilibria and enabling a scalable and tractable bilevel optimization formulation for system-level incentive design. Experiments on the Sioux Falls network show that the proposed approach reduces the efficiency gap between user equilibrium and system-optimal flows by up to 71% under realistic constraints. These results demonstrate the potential of non-monetary, infrastructure-light control for next-generation intelligent transportation and urban mobility systems.

2025-11-03T10:15:08Z Arda Kosay Yusuf Saltan Jyun-Jhe Wang Chung-Wei Lin Muhammed O. Sayin http://arxiv.org/abs/2603.14975v2 Why Agents Compromise Safety Under Pressure 2026-04-18T09:31:38Z

Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.

2026-03-16T08:37:34Z Accepted by ACL 2026 Findings; 18 pages, 5 figures Hengle Jiang Ke Tang http://arxiv.org/abs/2505.18351v4 Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation 2026-04-18T04:19:07Z

Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database-backed architecture for implementing stakeholder personas. Experiments tested agents' responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns ($R^2$ range: $0.58-0.61$) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining $73$% of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black-box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.

2025-05-23T20:18:14Z Accepted at ICLR 2026 Algorithmic Fairness Across Alignment Procedures and Agentic Systems (AFAA) Workshop Sola Kim Dongjune Chang Jieshu Wang http://arxiv.org/abs/2604.10159v2 ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification 2026-04-18T03:22:44Z

The advancement of large language models (LLMs) has enhanced tabular question answering (Tabular QA), yet they struggle with open-domain queries exhibiting underspecified or uncertain expressions. To address this, we introduce the ODUTQA-MDC task and the first comprehensive benchmark to tackle it. This benchmark includes: (1) a large-scale ODUTQA dataset with 209 tables and 25,105 QA pairs; (2) a fine-grained labeling scheme for detailed evaluation; and (3) a dynamic clarification interface that simulates user feedback for interactive assessment. We also propose MAIC-TQA, a multi-agent framework that excels at detecting ambiguities, clarifying them through dialogue, and refining answers. Experiments validate our benchmark and framework, establishing them as a key resource for advancing conversational, underspecification-aware Tabular QA research.

2026-04-11T11:11:54Z This paper has been accepted by ACL 2026 (main conference) Zhensheng Wang ZhanTeng Lin Wenmian Yang Kun Zhou Yiquan Zhang Weijia Jia http://arxiv.org/abs/2604.16706v1 Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench 2026-04-17T21:15:35Z

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram-ai/agenthallu-bench.

2026-04-17T21:15:35Z 9 pages, 5 figures, 12 tables (8 main + 4 supplementary). Under review at Information Processing & Management. Code and data: https://github.com/bhaskargurram-ai/agenthallu-bench Bhaskar Gurram http://arxiv.org/abs/2604.22820v1 Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows 2026-04-17T15:31:20Z

Long-horizon tool-using tasks sometimes benefit from revisiting earlier subtasks for recovery and exploration, but added multi-agent workflow flexibility can also introduce coordination overhead and substantial inference cost. We study complete cyclic subtask graphs, a deliberately maximally flexible multi-agent architecture in which executable subtask nodes are fully connected and a unified state-analysis-and-routing agent selects transitions using natural-language criteria. This makes unrestricted revisitation explicit and directly analyzable at the subtask level. We evaluate task-specific (Spec-Cyc) and benchmark-generic (Gen-Cyc) graphs on TextCraft, ALFWorld, and Finance-Agent, with ablations over planner/executor/router strength, tool exposure (generalist vs specialized), $n$-shot successful trajectory summaries, and fault-injected random subtask perturbations. The benchmarks expose three distinct regimes. ALFWorld highlights a setting where explicit revisitation supports recovery and exploration; TextCraft, a largely prerequisite-chain domain, often favors the efficiency of simpler forward execution; and Finance-Agent remains bottlenecked by retrieval, grounding, and evidence synthesis more than by workflow flexibility alone. Shared-win token comparisons further show that the added flexibility can be substantially more expensive than a single ReAct agent. Overall, we use complete cyclic subtask graphs as a maximally flexible experimental lens for measuring when multi-agent revisitation helps, when it mainly adds coordination cost, and when external task bottlenecks dominate.

2026-04-17T15:31:20Z Luay Gharzeddine Samer Saab http://arxiv.org/abs/2604.16081v1 Veritas-RPM: Provenance-Guided Multi-Agent False Positive Suppression for Remote Patient Monitoring 2026-04-17T14:07:19Z

We present Veritas-RPM, a provenance-guided multi-agent architecture comprising five processing layers: VeritasAgent (ground-truth assembly), SentinelLayer (anomaly detection), DirectorAgent (specialist routing), six domain Specialist Agents, and MetaSentinelAgent (conflict resolution and final decision). We construct a 98-case synthetic taxonomy of false-positive scenarios derived from documented RPM patterns. Synthetic patient epochs (n = 530) were generated directly from taxonomy parameters and processed through the pipeline. Ground-truth labels are known for all cases. Performance is reported as True Suppression Rate (TSR), False Escalation Rate (FER), and Indeterminate Rate (INDR).

2026-04-17T14:07:19Z Aswini Misro Vikash Sharma Shreyank N Gowda http://arxiv.org/abs/2604.16024v1 AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis 2026-04-17T12:54:26Z

Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.

2026-04-17T12:54:26Z Yaohui Han Tianshuo Wang Zixi Zhao Zhengchun Zhu Shuo Ren Yiru Wang Rongliang Fu Tinghuan Chen Tsung-Yi Ho http://arxiv.org/abs/2604.16022v1 SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems 2026-04-17T12:51:46Z

As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

2026-04-17T12:51:46Z Preprint Hikaru Shindo Hanzhao Lin Lukas Helff Patrick Schramowski Kristian Kersting http://arxiv.org/abs/2507.14995v4 LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading 2026-04-17T12:30:41Z

Real-time peer-to-peer (P2P) electricity markets dynamically adapt to fluctuations in renewable energy and variations in demand, maximizing economic benefits through instantaneous price responses while enhancing grid flexibility. However, scaling expert guidance for massive personalized prosumers poses critical challenges, including diverse decision-making demands and a lack of customized modeling frameworks. This paper proposes an integrated large language model-multi-agent reinforcement learning (LLM-MARL) framework for real-time P2P energy trading to address challenges such as the limited technical capability of prosumers, the lack of expert experience, and security issues of distribution networks. LLMs are introduced as experts to generate personalized strategies, guiding MARL under the centralized training with decentralized execution (CTDE) paradigm through imitation. To handle the scalability issues inherent in large-scale P2P networks, a differential attention-based critic network is introduced to efficiently extract key interaction features and enhance convergence. Experimental results demonstrate that LLM-generated strategies effectively substitute human experts. The proposed imitative expert MARL algorithms achieve significantly lower economic costs and voltage violation rates on test sets compared to baseline algorithms, while maintaining robust stability. This paper provides an effective solution for the real-time decision-making of the P2P electricity market by bridging expert knowledge with agent learning.

2025-07-20T14:59:18Z IEEE Transactions on Smart Grid (Early Access), 2026 Chengwei Lou Zekai Jin Wei Tang Guangfei Geng Jin Yang Lu Zhang 10.1109/TSG.2026.3684885 http://arxiv.org/abs/2605.16300v1 Consent Chain Degradation in Embodied Multi-Agent Systems: Bridging the Gap Between AI Agent Governance and Robot Ethics 2026-04-17T11:57:40Z

Robotic systems are moving from isolated platforms to interconnected multi-agent ecosystems that operate in human environments. This shift raises a governance problem that existing frameworks do not address: how does consent propagate, degrade, and break down across chains of delegation between embodied autonomous agents? The AI ethics community has begun to study consent for digital software agents, and the HRI community has examined consent in dyadic human-robot encounters. Neither body of work covers what happens when physical robots delegate tasks to other robots in ways that affect humans. This paper introduces consent chain degradation (CCD), a conceptual framework for analyzing how the specificity, validity, and scope of human consent erodes as authority passes through multi-robot delegation chains. We propose a three-layer governance architecture, the Consent Runtime Verification Framework for Embodied Agents (CoRVE), which integrates consent scope modeling, delegation chain tracking, and physical irreversibility assessment. Three scenarios in healthcare, domestic, and industrial robotics show how CCD arises in practice, including a worked numerical example. A regulatory gap analysis covering the EU AI Act, the GDPR, the Machinery Regulation, and the Revised Product Liability Directive shows that all four instruments leave core CCD dimensions unaddressed.

2026-04-17T11:57:40Z Accepted for oral presentation at the 2nd Workshop on Robot Ethics (WoRoBet), ICRA 2026, Vienna, Austria, June 1, 2026. 6 pages, 3 tables, 1 figure Mehmet Haklidir