https://arxiv.org/api/YzIcppQ/yyt2IUduSJqy+8+iw/0 2026-06-13T14:08:16Z 12619 60 15 http://arxiv.org/abs/2606.09122v1 Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations 2026-06-08T07:15:53Z

Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.

2026-06-08T07:15:53Z 7 pages, 6 figures Arun Malik http://arxiv.org/abs/2606.09037v1 A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach 2026-06-08T05:07:46Z

Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.

2026-06-08T05:07:46Z 26 pages, 21 figures Jinseong Han Sunwoong Yang Namwoo Kang http://arxiv.org/abs/2606.08960v1 Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops 2026-06-08T03:00:56Z

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

2026-06-08T03:00:56Z Ziqian Zhong Ivgeni Segal Ivan Bercovich Shashwat Saxena Kexun Zhang Aditi Raghunathan http://arxiv.org/abs/2605.16309v2 ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning 2026-06-08T02:31:38Z

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

2026-05-04T05:24:03Z Code Implementation: https://github.com/sbhakim/anneal-agents Safayat Bin Hakim Keyan Guo Wenkai Tan Alvaro Velasquez Shouhuai Xu Houbing Herbert Song http://arxiv.org/abs/2606.08878v1 PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting 2026-06-07T23:26:12Z

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

2026-06-07T23:26:12Z Youran Sun Xingyu Ren Kejia Zhang Xinpeng Liu Jiaxuan Guo http://arxiv.org/abs/2510.15747v3 GLP: A Grassroots, Multiagent, Concurrent, Logic Programming Language for AI (Full Version) 2026-06-07T20:02:43Z

A grassroots platform is a multiagent distributed system in which multiple independent instances can form and operate independently of each other and of any global resource, yet may coalesce into ever larger instances, possibly resulting in a single global instance. Grassroots platforms aim to offer an egalitarian/democratic alternative to centralised/autocratic and decentralised/plutocratic global platforms. Here, we present Grassroots Logic Programs (GLP), a multiagent concurrent logic programming language designed for the implementation of grassroots platforms: we recall the standard operational semantics of logic programs; introduce the concurrent operational semantics of GLP as its restriction; recall multiagent atomic transactions; use them to introduce a multiagent operational semantics of GLP; and prove multiagent GLP to be grassroots. The grassroots social graph -- the foundational grassroots platform on which all others are based -- serves as a GLP programming example. These mathematical foundations are being used by AI to implement GLP as well as to program in GLP: a workstation-based implementation of concurrent GLP in Dart was derived from the concurrent operational semantics of GLP; a multiagent smartphone-based implementation of GLP in Dart/Flutter is being developed based on the multiagent operational semantics of GLP; a moded type system for GLP was designed (and implemented by AI in Dart) to facilitate collaborative human-AI development of GLP programs, where AI derives working GLP programs from human-approved type definitions and declarations; GLP implementations of grassroots platforms for the social graph, social networks, currencies and bonds, and more, have been derived by AI from mathematical specifications written as volitional multiagent atomic transactions.

2025-10-17T15:34:27Z Ehud Shapiro http://arxiv.org/abs/2606.08790v1 RAILS: Verification-Native Clearing For Agentic Commerce 2026-06-07T19:12:55Z

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none produce it. Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions. RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce, spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them. The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope, Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal model of admissibility-graded verification, together yield a soundness property: no financially material settlement is supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium. This paper specifies that clearing protocol.

2026-06-07T19:12:55Z 49 pages, 15 figures Adrian de Valois-Franklin Alex Bogdan http://arxiv.org/abs/2602.06934v4 Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version) 2026-06-07T18:21:02Z

Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. The language was introduced together with concurrent (cGLP) and multiagent (maGLP) operational semantics. Here, we derive from these (\ia)~dGLP, a deterministic counterpart of cGLP, and (\ib)~madGLP, a counterpart of maGLP in which deterministic agents communicate solely by asynchronous message passing, and prove them correct against their abstract counterparts. maGLP shared variable pairs spanning agents can be implemented as local variables paired by \emph{global links}, with correctness following from disjoint substitution commutativity (a consequence of GLP's single-occurrence invariant). We further prove that madGLP is grassroots. Both dGLP and madGLP serve as formal specifications for an AI-driven implementation discipline (math $\to$ informal spec $\to$ Dart) employed and described here: from dGLP, AI (Claude) developed a workstation-based GLP implementation in Dart, and from madGLP it is developing a smartphone-based multiagent one.

2026-02-06T18:30:11Z Ehud Shapiro http://arxiv.org/abs/2512.20845v2 MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs 2026-06-07T16:01:34Z

LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.

2025-12-23T23:47:31Z Onat Ozer Yuchen Wang Grace Wu Daniel Dosti Honghao Zhang Vivi De La Rue http://arxiv.org/abs/2606.08701v1 Is Telehealth Better Used to Treat Patients or Help Other Physicians Treat Patients? An Agent-Based Modeling Study of Healthcare Provision 2026-06-07T15:56:48Z

Telehealth, the delivery of medical care remotely, is hoped to increase access to specialty services or decrease health care utilization. Physicians can provide telehealth to each other or to patients. Specialists often treat complex patients who can be adequately cared for only in academic hospitals, suggesting that providing specialty services via telehealth will reallocate rather than reduce system utilization. Here I use agent-based modeling to investigate telehealth's effects on clinical outcomes and system utilization in medical toxicology. I found that physician-physician telehealth increased patient health but system utilization did not change. The effects were more pronounced as clinical complexity increased. Physician-patient telehealth increased cost and system utilization but not clinical outcomes. Within the limitations of our approach, these results suggest that telehealth is more cost-effective for improving generalist access to specialist knowledge than in providing care to the public.

2026-06-07T15:56:48Z Presented at HICSS 2022 Michael Chary http://arxiv.org/abs/2601.17957v4 Moded Types for Grassroots Logic Programs, by AI, for AI (Full Version) 2026-06-07T15:28:48Z

Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. ``Logic Programs as Types for Logic Programs'' (LICS'91) defined types as regular sets of paths over the Herbrand atom semantics of a logic program. Here, we develop a \emph{moded-atom semantics} that extends the standard Herbrand atom semantics in two ways: (\ia)~each atom subterm carries a \emph{mode}, recording whether it is consumed from or produced to the environment; and (\ib)~partial computations, including those that deadlock, fail, or never terminate, also contribute moded atoms to the semantics. We define types to be regular sets of \emph{moded paths} over this semantics, give a syntactic definition of GLP well-typing, and prove that a well-typed program is sound: every output path in its well-typed moded-atom semantics conforms to its declared output type. A type checker for GLP was implemented \emph{by} AI (Claude) in Dart, starting from the mathematical specification of Typed GLP (this paper), deriving from it an English+pseudocode spec (written by AI), and from the spec deriving Dart code (by AI). While GLP is naturally untyped, the motivation for typing it was \emph{for} AI: tasking AI to program complex communication modalities and hoping for the best turned out to be a tenuous strategy. The discipline we developed with Typed GLP is for the human designer and AI to jointly develop formal GLP type definitions and declarations, together with informal intent of the declared procedures, and only then let AI write the GLP code.

2026-01-25T19:18:27Z Ehud Shapiro http://arxiv.org/abs/2508.06336v2 Unsupervised Partner Design Enables Robust Ad-hoc Teamwork 2026-06-07T11:04:52Z

We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently achieves strong performance compared to both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than all evaluated baseline methods.

2025-08-08T14:11:15Z 27 pages Constantin Ruhdorfer Matteo Bortoletto Victor Oei Anna Penzkofer Andreas Bulling http://arxiv.org/abs/2606.08552v1 Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents 2026-06-07T10:12:44Z

I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian probability and information theoretic optimization, including Active Inference, may be incorporated with promise semantics -- as well as how Promise Theory supplements solutions, helping to avoid probability's pitfalls, which include non-local coordination, calibrating, and normalizing probabilistic computations. The role of boundary conditions in constraining allowed states and selecting decision thresholds is a form of promise, and agent alignment provides a scalable definition of intent. Autonomous agents may congeal into swarms with superagent characteristics by trying to minimize their information, despite uncertainty that works to maximize it. The use of Promise Theory involves some research challenges as well as stylistic preferences.

2026-06-07T10:12:44Z Mark Burgess http://arxiv.org/abs/2604.08849v2 SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching 2026-06-07T05:55:06Z

Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.

2026-04-10T01:13:44Z Cyrus Zhou Yufei Jin Yilin Xu Yu-Chiang Wang Chieh-Ju Chao Monica S. Lam http://arxiv.org/abs/2606.08457v1 The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment 2026-06-07T05:14:26Z

Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA-USMLE and MedThink-Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt-level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross-agent reasoning alignment as a quantity to audit alongside accuracy in safety-critical domains.

2026-06-07T05:14:26Z Xiaoyang Wang Christopher C. Yang