https://arxiv.org/api/aeBHIasmJ+1BQkX8U3FvGDAy9kU 2026-06-18T22:18:47Z 12677 465 15 http://arxiv.org/abs/2512.18470v6 SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios 2026-05-22T09:24:26Z

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

2025-12-20T19:08:15Z Tue Le Minh V. T. Thai Dung Nguyen Manh Huy Phan Nhat Nghi D. Q. Bui http://arxiv.org/abs/2605.17076v2 S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination 2026-05-22T08:53:52Z

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2026-05-16T16:46:27Z v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files Sajjad Khan http://arxiv.org/abs/2605.27433v1 Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market 2026-05-22T08:44:56Z

With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable development of the data service market are also growing. These factors include the widespread dissemination of information enhancing subjective consciousness, the continuous improvement in intelligence, and the complexification of structural relationships. To achieve effective governance and regulation of the data service market, it is crucial to conduct simulation experiments before making regulatory decisions. However, current research and analysis of the data service market primarily focus on data-level performance, proving inadequate when it comes to measurement and analysis of multiple heterogeneous entities and the integration of various social elements within the data service market. Based on this, this paper innovatively proposes a data service market measurement and network analysis method based on heterogeneous multi-agent modeling. By introducing the service ecosystem theory, we clarify the participants and external factors of the data service market and conduct utility measurements for three-level entities based on value creation. Furthermore, an analytical methodology is devised to precisely assess the influence of heterogeneous networks on utility. Finally, the paper verifies the effectiveness of the proposed method through the analysis of experimental results.

2026-05-22T08:44:56Z Deyu Zhou Yuwei Guo Xudong Lu Linhao Zhang Wei Guo Lizhen Cui http://arxiv.org/abs/2606.07552v1 Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings 2026-05-22T08:24:14Z

Large language models exhibit innate behavioral tendencies when deployed as strategic agents -- notably a risk-averse "turtle" bias toward defensive play. We show that symbolic reasoning frameworks, injected as per-round reflective prompts into one agent, differentially modulate this bias and reshape the multi-agent ecosystem to produce framework-specific winner distributions. In a 7-player Warring States Diplomacy variant (41 games, 4 conditions, single-campaign memory accumulation), each framework produces a distinct ecosystem signature: under control, Yan dominates (7/11, 64%); under I-Ching yarrow divination, Yan and Chu co-dominate while Qin is completely suppressed (0/10); under Tarot, Qin dominates (5/10, Fisher vs. pooled p = 0.006); under scrambled-text ablation (incoherent oracle text preserving prompt structure), Qi dominates (5/10, Fisher vs. pooled p = 0.006). The framework-receiving agent (Han) never wins and shows no survival difference across conditions (Fisher p = 1.0), but Tarot consistently elevates Han's peak territory (mean 3.0 SCs vs. 2.1-2.5 others, Kruskal-Wallis p = 0.010). Neither framework's content predicts subsequent actions -- hexagram themes (chi-squared p = 0.95) and Tarot card postures (chi-squared p = 0.69) are both independent of action choice -- suggesting the modulation operates through the reflective process, not content-following. We present this as an observation paper establishing that alignment-framework choice at the agent level produces distinctive system-level consequences in multi-agent settings.

2026-05-22T08:24:14Z 17 pages, 3 figures, 6 tables, 6 listings. Code and data: https://doi.org/10.5281/zenodo.20338937 Augustin Chan http://arxiv.org/abs/2605.23321v1 Arrow-Type Impossibility for Genuinely Modal Judgments 2026-05-22T07:36:30Z

Judgment aggregation studies how to combine individual judgments on logically related propositions into a collective judgment. Classical impossibility results show that sufficiently strong logical interconnections force dictatorship under natural aggregation axioms. In this paper, we ask whether such impossibility can still arise when the objects of aggregation are required to be genuinely modal judgments rather than plain factual propositions. Since modal logic contains propositional logic, this question is meaningful only if one excludes fact-based aggregation in disguise. We show that Arrow-type impossibility already re-emerges in a strikingly sparse modal setting. We prove an impossibility theorem on a simple cyclic frame for an agenda generated from a single propositional variable by repeated applications of a single modal operator, and we further demonstrate this phenomenon for an alternative family of frames satisfying a natural symmetry condition. Thus, even under a modal-operator requirement, semantic structure alone can generate the logical interconnections needed for dictatorship. Technically, our analysis has two layers. First, we prove a semantic reduction theorem showing that certain iterated modal patterns can be collapsed by shifting the evaluation point. Second, building on this reduction, we identify a local-to-global frame mechanism by which frame geometry yields minimally inconsistent modal judgment sets and the strong path-connectivity required for impossibility. The same reduction also turns consistency checking into a small combinatorial covering problem, which yields efficient implementations of non-dictatorial aggregation procedures.

2026-05-22T07:36:30Z 24 pages Yutaka Nagai Hirotaka Ono http://arxiv.org/abs/2605.23273v1 Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework 2026-05-22T06:27:18Z

Topology optimization is a widely used design method that produces optimized material distributions for prescribed objectives and constraints through well-established numerical algorithms. Throughout the workflow, engineers make a series of decisions ranging from setting and adjusting numerical parameters to assessing whether the converged design meets considerations beyond those explicitly included in the optimization problem, such as physical feasibility. These decisions, which draw on domain expertise, interfere with the autonomous design process. To address this difficulty, this study presents TopOptAgents, a multi-agent system for automating not only the design process but also decision-making during the key stages of the topology optimization process. TopOptAgents consists of six LLM-based agents collaborating through iterative self-refinement cycles spanning problem formulation, validation, code generation and execution, and quality assessment of the optimized structure. This process enables error correction and progressive improvement of both the optimization setup and resulting design. The framework is demonstrated on optimization problems selected to cover a range of settings that differ in their literature coverage and numerical characteristics The benefits of iterative self-refinement are found to be particularly pronounced for problem classes where the pretrained language model has limited prior exposure, such as formulations whose literature and open-source implementations are comparatively sparse. In such cases, the proposed framework reliably produces converged designs where a single state-of-the-art LLM struggles, suggesting that self-refinement broadens the range of topology optimization problems that LLM-based automation can reliably address.

2026-05-22T06:27:18Z 28 pages, 17 figures Hyunjee Park Hayoung Chung http://arxiv.org/abs/2605.23238v1 GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models 2026-05-22T05:13:45Z

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

2026-05-22T05:13:45Z 33 pages, 8 figures, 9 tables (4 figures, 2 tables in main paper) Vartan Shadarevian Kia Ghods Alex Kenich Anany Kotawala http://arxiv.org/abs/2605.23193v1 CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening 2026-05-22T03:20:04Z

Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.

2026-05-22T03:20:04Z Preprint, 9 pages. Website: https://hello-diana.github.io/CultivAgents/ Yiyang Wang Moeiini Reilly Britney Johnson Kefei Yan Alex Cabral Josiah Hester http://arxiv.org/abs/2412.01524v6 Cost-Aware Distributed Online Learning with Strict Rejection Behavior against Adversarial Agents 2026-05-22T02:59:05Z

Distributed online learning in Internet of Things(IoT)-enabled multi-agent systems(MASs) is highly vulnerable to persistent adversarial interactions, particularly when malicious agents cannot be fully isolated during the transient learning stage. Existing resilient learning methods mainly focus on convergence preservation or malicious suppression, while the resulting evolution inefficiency caused by repeated corrective adaptation remains largely unexplored. To address this issue, this paper develops a cost-aware distributed online learning framework with a strict rejection behavior against adversarial agents. The proposed mechanism suppresses harmful assimilation of suspicious neighboring information and reveals a previously overlooked side effect, that is, the strict rejection may induce heterogeneous transient evolution among neighboring normal agents, leading to evolution desynchronization across the network. To mitigate this effect, a two-time-scale adaptive evolution regulation architecture is further developed, in which the outer layer dynamically adjusts the long-term evolution-rate schedule while the inner layer preserves robust online learning. Theoretical analysis establishes the dynamic tracking property of the outer-layer update and proves that the proposed regulation mechanism attenuates the propagation of strict-rejection-induced evolution desynchronization. Numerical simulations and a satellite-assisted IoT monitoring scenario demonstrate that the proposed method achieves robust and low-cost distributed online learning under persistent malicious interference.

2024-12-02T14:16:25Z 13 pages, 10 figures, 2 tables Yuhan Suo Runqi Chai Senchun Chai Xudong Zhao Yuanqing Xia http://arxiv.org/abs/2602.12316v2 GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory 2026-05-22T00:32:35Z

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

2026-02-12T17:29:52Z Pepijn Cobben Xuanqiang Angelo Huang Thao Amelia Pham Isabel Dahlgren Terry Jingchen Zhang Zhijing Jin http://arxiv.org/abs/2605.23099v1 SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate 2026-05-21T23:17:03Z

Multi-Agent Debate (MAD) improves LLM-agent accuracy but suffers from rapid context growth, limiting scalability in larger multi-agent settings. Existing methods prune low-utility communications using prior signals, such as token-level log-likelihoods or LLM self-reported confidence. However, these signals become unreliable under hallucination, degrading the accuracy of MAD methods that rely on them. We propose SVR-MAD, a Bayesian-inspired MAD framework that treats pre-debate signals as priors and debate outcomes as posterior-style evidence for estimating agent correctness. SVR-MAD uses this evidence to incrementally construct the communication graph, prioritizing agents whose answers survive peer challenges. Experiments across multiple LLMs and benchmarks show that SVR-MAD reduces token cost by up to 61% while matching or improving accuracy relative to the most accurate competing MAD baseline.

2026-05-21T23:17:03Z Weifan Jiang Rana Shahout Minghao Li Zhenting Qi Yilun Du Michael Mitzenmacher Minlan Yu http://arxiv.org/abs/2605.23023v1 How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning 2026-05-21T20:47:18Z

In orchestrated multi-agent systems, humans often struggle to manage plans due to their complexity and limited transparency. Existing approaches rely on outcome-level supervision, where users verify only final outputs without visibility into intermediate reasoning. We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning. We release code and data at https://github.com/megagonlabs/ambipom.

2026-05-21T20:47:18Z ACM Conference on AI and Agentic Systems (CAIS) 2026 Zeyu He Hannah Kim Dan Zhang Estevam Hruschka 10.1145/3786335.3813144 http://arxiv.org/abs/2605.06936v2 Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing 2026-05-21T20:26:07Z

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

2026-05-07T20:54:07Z Pengju Liu Nuo Xu Jinwei Tang Yu Cao Caiwen Ding http://arxiv.org/abs/2601.22324v2 Automatic Construction of Clinical Scoring Systems with LLM Agents 2026-05-21T19:33:14Z

Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

2026-01-29T21:11:06Z Silas Ruhrberg Estévez Christopher Chiu Mihaela van der Schaar http://arxiv.org/abs/2601.14652v5 MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks 2026-05-21T18:10:39Z

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

2026-01-21T04:57:02Z ICML 2026 Zixuan Ke Yifei Ming Austin Xu Ryan Chin Xuan-Phi Nguyen Prathyusha Jwalapuram Jiayu Wang Semih Yavuz Caiming Xiong Shafiq Joty