https://arxiv.org/api/M2KLs7kAwgGhKrW4lvIZtEp0vss2026-06-09T20:33:27Z12584015http://arxiv.org/abs/2606.09800v1FASE: Fast Adaptive Semantic Entropy for Code Quality2026-06-08T17:53:05ZMulti-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.2026-06-08T17:53:05ZShizhe LinLadan Tahvildarihttp://arxiv.org/abs/2505.01804v2Pathfinders in the Sky: Formal Decision-Making Models for Collaborative Air Traffic Control in Convective Weather2026-06-08T15:25:00ZAir traffic can be significantly disrupted by weather. Pathfinder operations involve assigning a designated aircraft to assess whether airspace that was previously impacted by weather can be safely traversed through. Despite relatively routine use in air traffic control, there is little research on the underlying multi-agent decision-making problem. We seek to address this gap herein by formulating decision models to capture the operational dynamics and implications of pathfinders. Specifically, we construct a Markov chain to represent the stochastic transitions between key operational states (e.g., pathfinder selection). We then analyze its steady-state behavior to understand long-term system dynamics. We also propose models to characterize flight-specific acceptance behaviors (based on utility trade-offs) and pathfinder selection strategies (based on sequential offer allocations). We then conduct a worst-case scenario analysis that highlights risks from collective rejection and explores how selfless behavior and uncertainty affect system resilience. Empirical analysis of data from the US Federal Aviation Administration demonstrates the real-world significance of pathfinder operations and informs future model calibration.2025-05-03T12:20:24Z2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC)Jimin ChoiKartikeya AnandHusni R. IdrisHuy T. TranMax Z. Li10.1109/ITSC60802.2025.11423180http://arxiv.org/abs/2606.09282v1Revisiting mesoscopic traffic flow simulation in SUMO: Limitations, analysis, and an alternative2026-06-08T09:51:09ZMesoscopic traffic flow models combines the merits of both macroscopic and microscopic models by capturing individual vehicle behavior in great detail and remaining the computational efficiency. At the time of this study, the mesoscopic model proposed by Eissfeldt (2004) is used in Simulation of Urban MObility (SUMO). The movement of vehicles is governed by dynamic headways between edges. However, the model does not fully comply with the principle of the Lighthill-Whitham-Richards (LWR) model. Several problems are identified, including the incomplete consideration of queue dynamics and the limited implementation of backward traveling spaces. Two case study scenarios demonstrate that the problems lead to unrealistic onset and recovery pattern of congestion. The magnitude of congestion is generally underestimated with this model. To address these drawbacks, a proper mesoscopic discrete-time implementation of link transmission model, which follows the LWR principle, is proposed. By explicitly incorporating backward traveling spaces to capture queue spillback phenomena, the proposed model provides a more precise representation of congestion dynamics. The link density outputs are consistent with the kinematic wave theory and the microscopic traffic simulation in SUMO, thus verifying its theoretical accuracy.2026-06-08T09:51:09ZPresentation at SUMO Conference 2026Ying-Chuan NiAlina AkopianAnastasios KouvelasMichail A. Makridishttp://arxiv.org/abs/2603.18388v2Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization2026-06-08T08:18:54ZAutomatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.2026-03-19T01:14:36ZAccepted at ACL SRW 2026Shiyan LiuQifeng XiaQiyun XiaYisheng LiuXinyu YuRui Quhttp://arxiv.org/abs/2606.09176v1Performance Evaluation of Social Learning2026-06-08T08:14:56ZSocial Learning is a decentralized decision-making paradigm in which spatially dispersed agents collect streaming observations regulated by one of a finite number of models (the hypotheses). The agents are interested in assigning probability scores (the beliefs) to the possible hypotheses. To this end, the agents exchange their beliefs according to a certain communication graph. It has been shown that, under reasonable conditions on the identifiability of the decision model and the network connectivity, each agent ultimately places all the belief mass on the true hypothesis governing the data. However, several questions remain unanswered regarding the evaluation of the social learning performance. One recently adopted performance metric is the rejection rate, i.e., the rate at which the beliefs about the erroneous hypotheses vanish. One contribution of this work is to establish that the rejection rate leads to several paradoxes, which make it unsuitable as a valid performance measure. We then focus on studying the error probability measure. For a binary Gaussian problem, we derive an analytical formula characterizing the ratio between the individual agents' probabilities and the optimal Bayesian probability. The formula shows that this ratio is expressed by the product of two terms quantifying the effect of the network connectivity and the role of the prior information. As a result, an irreducible gap emerges between the decentralized and the centralized error probabilities, which is agent-dependent and does not disappear asymptotically.2026-06-08T08:14:56ZThis work has been submitted to the IEEE for possible publicationFelice ScalaMarco CarpentieroVincenzo MattaAli H. Sayedhttp://arxiv.org/abs/2606.09122v1Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations2026-06-08T07:15:53ZCloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.2026-06-08T07:15:53Z7 pages, 6 figuresArun Malikhttp://arxiv.org/abs/2606.09037v1A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach2026-06-08T05:07:46ZInterior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.2026-06-08T05:07:46Z26 pages, 21 figuresJinseong HanSunwoong YangNamwoo Kanghttp://arxiv.org/abs/2606.08960v1Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops2026-06-08T03:00:56ZAgent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive.
We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers.
On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.2026-06-08T03:00:56ZZiqian ZhongIvgeni SegalIvan BercovichShashwat SaxenaKexun ZhangAditi Raghunathanhttp://arxiv.org/abs/2605.16309v2ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning2026-06-08T02:31:38ZLLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.2026-05-04T05:24:03ZCode Implementation: https://github.com/sbhakim/anneal-agentsSafayat Bin HakimKeyan GuoWenkai TanAlvaro VelasquezShouhuai XuHoubing Herbert Songhttp://arxiv.org/abs/2606.08878v1PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting2026-06-07T23:26:12ZReal-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.2026-06-07T23:26:12ZYouran SunXingyu RenKejia ZhangXinpeng LiuJiaxuan Guohttp://arxiv.org/abs/2510.15747v3GLP: A Grassroots, Multiagent, Concurrent, Logic Programming Language for AI (Full Version)2026-06-07T20:02:43ZA grassroots platform is a multiagent distributed system in which multiple independent instances can form and operate independently of each other and of any global resource, yet may coalesce into ever larger instances, possibly resulting in a single global instance. Grassroots platforms aim to offer an egalitarian/democratic alternative to centralised/autocratic and decentralised/plutocratic global platforms.
Here, we present Grassroots Logic Programs (GLP), a multiagent concurrent logic programming language designed for the implementation of grassroots platforms: we recall the standard operational semantics of logic programs; introduce the concurrent operational semantics of GLP as its restriction; recall multiagent atomic transactions; use them to introduce a multiagent operational semantics of GLP; and prove multiagent GLP to be grassroots. The grassroots social graph -- the foundational grassroots platform on which all others are based -- serves as a GLP programming example.
These mathematical foundations are being used by AI to implement GLP as well as to program in GLP: a workstation-based implementation of concurrent GLP in Dart was derived from the concurrent operational semantics of GLP; a multiagent smartphone-based implementation of GLP in Dart/Flutter is being developed based on the multiagent operational semantics of GLP; a moded type system for GLP was designed (and implemented by AI in Dart) to facilitate collaborative human-AI development of GLP programs, where AI derives working GLP programs from human-approved type definitions and declarations; GLP implementations of grassroots platforms for the social graph, social networks, currencies and bonds, and more, have been derived by AI from mathematical specifications written as volitional multiagent atomic transactions.2025-10-17T15:34:27ZEhud Shapirohttp://arxiv.org/abs/2606.08790v1RAILS: Verification-Native Clearing For Agentic Commerce2026-06-07T19:12:55ZAutonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they
met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the
agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and
network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none
produce it.
Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is
not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions.
RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce,
spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them.
The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope,
Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal
model of admissibility-graded verification, together yield a soundness property: no financially material settlement is
supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We
are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches
nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium.
This paper specifies that clearing protocol.2026-06-07T19:12:55Z49 pages, 15 figuresAdrian de Valois-FranklinAlex Bogdanhttp://arxiv.org/abs/2602.06934v4Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version)2026-06-07T18:21:02ZGrassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities.
The language was introduced together with concurrent (cGLP) and multiagent (maGLP) operational semantics. Here, we derive from these (\ia)~dGLP, a deterministic counterpart of cGLP, and (\ib)~madGLP, a counterpart of maGLP in which deterministic agents communicate solely by asynchronous message passing, and prove them correct against their abstract counterparts. maGLP shared variable pairs spanning agents can be implemented as local variables paired by \emph{global links}, with correctness following from disjoint substitution commutativity (a consequence of GLP's single-occurrence invariant). We further prove that madGLP is grassroots. Both dGLP and madGLP serve as formal specifications for an AI-driven implementation discipline (math $\to$ informal spec $\to$ Dart) employed and described here: from dGLP, AI (Claude) developed a workstation-based GLP implementation in Dart, and from madGLP it is developing a smartphone-based multiagent one.2026-02-06T18:30:11ZEhud Shapirohttp://arxiv.org/abs/2512.20845v2MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs2026-06-07T16:01:34ZLLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.2025-12-23T23:47:31ZOnat OzerYuchen WangGrace WuDaniel DostiHonghao ZhangVivi De La Ruehttp://arxiv.org/abs/2606.08701v1Is Telehealth Better Used to Treat Patients or Help Other Physicians Treat Patients? An Agent-Based Modeling Study of Healthcare Provision2026-06-07T15:56:48ZTelehealth, the delivery of medical care remotely, is hoped to increase access to specialty services or decrease health care utilization. Physicians can provide telehealth to each other or to patients. Specialists often treat complex patients who can be adequately cared for only in academic hospitals, suggesting that providing specialty services via telehealth will reallocate rather than reduce system utilization. Here I use agent-based modeling to investigate telehealth's effects on clinical outcomes and system utilization in medical toxicology. I found that physician-physician telehealth increased patient health but system utilization did not change. The effects were more pronounced as clinical complexity increased. Physician-patient telehealth increased cost and system utilization but not clinical outcomes. Within the limitations of our approach, these results suggest that telehealth is more cost-effective for improving generalist access to specialist knowledge than in providing care to the public.2026-06-07T15:56:48ZPresented at HICSS 2022Michael Chary