https://arxiv.org/api/MeGZj3dP5Om/dX/VwyN9bX8qJ5U2026-06-25T19:03:13Z1275090015http://arxiv.org/abs/2605.05657v1Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation2026-05-07T04:18:53ZMulti-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval-Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code-Agent, a multi-agent framework whose sub-agents are governed by formal contracts with six-dimensional budget vectors. Our headline contribution is the composition of two previously separate lines of work -- complexity-conditioned LLM routing and formal resource algebras -- yielding a property neither admits alone: provable budget conservation under retrieval-conditioned dynamic topology selection. Concretely we contribute: (1) a complexity-conditioned topology router that reduces proxy-measured misrouting from 30.1% to 8.2%; (2) a budget algebra with a structural-induction conservation theorem; and (3) a hierarchical code retrieval engine. Empirical evaluation demonstrates sub-millisecond DAG construction and linear tree-index scalability.2026-05-07T04:18:53Z30 pages, 9 figures. NeurIPS 2026 Evaluations and Datasets Track Submission Under reviewAbhijit TalluriPujith AnneBhagavan Choudary PendiyalaRaghavendra Chilukurihttp://arxiv.org/abs/2605.05482v1FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking2026-05-06T22:04:44ZLarge language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% "I don't know" rate, substantially improving over the base model's unsafe 4.3% rate while avoiding GPT-4.1's over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1 percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3-5x faster responses at 20-50x lower cost compared to GPT-4.1.2026-05-06T22:04:44Z7 pages, ACL 2026 conferenceDenys KaterenchukPablo DuboueKeelan EvaniniDavid GondekNithin GovindugariOlivier AllauzenJoshua BaptisteDavid J MoreJoshua Schechterhttp://arxiv.org/abs/2605.05020v1Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning2026-05-06T15:18:42ZSystem Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-SND, which replaces this complete-graph average with a weighted average over the edges of an arbitrary graph $G$. Three regimes follow: $G=K_n$ recovers SND exactly; a fixed sparse $G$ defines a localized diversity measure at $O(|E|)$ cost; and random edge samples yield an unbiased Horvitz-Thompson estimator and a normalized sample mean with $O(1/\sqrt{m})$ concentration in the sampled edge count $m$. For fixed sparse graphs we prove forwarding-index distortion bounds for expanders and a spectral refinement under low-rank distance structure; for random $d$-regular graphs we prove an unconditional probabilistic $\widetilde{\mathcal{O}}(D_{\max}/\sqrt{n})$ bound. On VMAS we verify recovery, unbiasedness, concentration, and wall-clock scaling, with a PettingZoo TVD panel checking non-Gaussian transfer. In a 500-iteration $n=100$ PPO run, Bernoulli-$0.1$ Graph-SND tracks full SND while reducing per-call metric time by about $10\times$, and frozen-policy GPU timing up to $n=500$ follows the predicted $\binom{n}{2}/|E|$ speedup. Random $d$-regular expanders empirically achieve $\mathrm{SND}_{G}^{\mathrm{u}}/\mathrm{SND} \in [0.9987, 1.0013]$ at $Θ(n \log n)$ edges. In DiCo diversity control at $n=50$, Bernoulli-$0.1$ Graph-SND preserves set-point tracking with paired reward differences indistinguishable from zero across nine matched cells while cutting per-call metric cost by ${\sim}9.5\times$. Together, these results show that the SND aggregation bottleneck can be removed without changing the metric's semantics, yielding a drop-in sparse alternative that scales beyond complete-graph SND and supports both passive measurement and closed-loop diversity control.2026-05-06T15:18:42Z22 pages, 12 figures, 7 tablesShawn Rayhttp://arxiv.org/abs/2605.23949v1SODE: Analyzing Social Dynamics in LLM Agents2026-05-06T14:50:07ZAs Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becomes essential. While behavioral game theory offers a framework to study these interactions, previous work has predominantly relied on outcome-based metrics such as average scores. This focus overlooks the mechanisms that facilitate sustainable cooperation, as identical scores can be derived from vastly different strategies. To bridge this gap, we introduce SODE (Social Dynamics Evaluation), a framework that evaluates LLM agents across three evolutionary dimensions: Direct Reciprocity for strategy adaptation, Indirect Reciprocity for reputation sensitivity, and Group Dynamics for cooperative resilience. Applying SODE reveals systematic divergences: instruction-tuned models often exhibit "passive compliance" that renders them vulnerable to exploitation, while reasoning models prioritize short-horizon optimization, destabilizing long-term cooperation. Notably, we demonstrate that a "long-horizon framing" can unlock reciprocal capabilities in reasoning models. Thus, SODE offers a systematic, mechanism-grounded benchmark for aligning AI agents with complex human social dynamics.2026-05-06T14:50:07ZInseo JungYoonseok OhKyungryul BackJinkyu KimJungbeom Leehttp://arxiv.org/abs/2605.04922v1Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation2026-05-06T13:50:40ZLLM-empowered multi-agent systems offer new potential to accelerate scientific discovery by generating novel research ideas. However, existing methods typically coordinate agents through temporary texts, such as drafts or chat logs; it is difficult to pinpoint the weaknesses in the generated ideas and how the agents refine them. To this end, we introduce \textbf{Evolving Idea Graphs} (EIG), a graph-based multi-agent scientific ideation framework that can generate high-performance research ideas across various benchmark-native metrics, such as novelty, feasibility, and clarity. Instead of coordinating solely through texts, EIG represents a partially formed proposal as an evolving idea graph, where nodes capture scientific claims and edges encode relations (e.g., support and conflict), enabling unresolved weaknesses to remain identifiable throughout the idea evolving process. Specifically, a learned two-head controller operates over the evolving graph to guide the ideation: one head selects graph edits for agents to execute, while the other decides when the graph is ready for commit as final proposal synthesis. On AI Idea Bench 2025 and LiveIdeaBench, EIG outperforms all compared systems on both automatic benchmark scores and blind expert ratings. Ablations further show that explicit graph state provides the main performance gains, and learned edit-and-commit control adds consistent improvements.2026-05-06T13:50:40ZJiangwen DongBo LiWanyu Linhttp://arxiv.org/abs/2605.23948v1comokit4py : a python package to ease COMOKIT agent based model simulation integration into a high performance computing workflow2026-05-06T13:32:00ZAgent-based model (ABM) are a kind of computer model that makes it possible to simulate a set of autonomous interacting programs called agents in a shared virtual environment. Among other application field, it has been commonly used to simulate social phenomena such as urban segregation, opinion dynamic or epidemiological crisis [1]. Recently, a research emphasis has been put on ABM to study in silico the impact of non-pharmaceutical interventions to mitigate the SARS-CoV-2 outbreak of 2020, with few of them that had a great impact on global political responses [2]. Among the model used COMOKIT [3] has been design to simulate the every-day-life of inhabitant of various cities in Vietnam and test policy interventions for various COVID-19 spread scenarios. Such endeavor required huge computational power to handle a huge number of simulation replication over a large set of parameters. In this proposal we present a python package that enables to easily generate, explore and build reports for any COMOKIT experiment to be launched over High-Performance Computing (HPC) infrastructure.2026-05-06T13:32:00Z6 pages, 2 figures, IEEE submissionArthur BrugièreKévin Chapuishttp://arxiv.org/abs/2605.04811v1Tree-based Credit Assignment for Multi-Agent Memory System2026-05-06T12:02:59ZMemory systems are widely adopted to enhance LLMs for long-horizon tasks, and are commonly organized as multi-agent pipelines with memory building, summarizing, and retrieval agents. To empower this system, existing RL-based methods either apply final downstream task rewards (e.g., QA accuracy) for all agents uniformly, which are coarse and ambiguous, or design task-specific rewards for agents on different subtasks, which require costly annotations (e.g., key evidence) and are difficult to define reliably. To address these limitations, we propose Tree-based Credit Assignment for Multi-Agent Memory Systems (TreeMem), which derives agent-specific credit from the final reward without task-specific annotations. Specifically, TreeMem extends the multi-agent pipeline (builder--summarizer--retrieval) into a tree structure, where each agent's outputs are expanded into multiple subsequent branches. The contribution of each agent is estimated via Monte Carlo averaging over its subsequent branches, capturing how intermediate agent actions may influence the final reward. This converts the coarse final reward into agent-specific optimization signals. These signals are then used to update all agent policies simultaneously, helping heterogeneous agents specialize effectively. Experiments on long-horizon benchmarks show that TreeMem improves memory system performance over strong baselines, validating the effectiveness of tree-structured credit assignment for the multi-agent memory system.2026-05-06T12:02:59ZMarina MaoAlexandr LiuPengbo LiSiheng LiBo ZhouXiang Wanghttp://arxiv.org/abs/2605.04777v1Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents2026-05-06T11:30:21ZAutonomous Earth Observation (EO) agents are transitioning from passive perception to complex, multi-step task execution. However, current architectures that integrate planning and execution within a single model often struggle with combinatorial complexity and reasoning errors in dynamic EO scenarios. To resolve these challenges, we propose the Lightweight Multimodal Meta-Planner (LMMP) framework. LMMP incorporates a dual-awareness mechanism that grounds strategic plans in both multimodal image features and high-level task semantics. Crucially, we introduce a Meta Task Library to inject remote sensing expert knowledge directly into the workflow, which standardizes domain logic and ensures plans are physically feasible. We further implement a two-stage training pipeline, initializing the Meta-Planner via expert-distilled Supervised Fine-Tuning and refining it through Direct Preference Optimization based on execution feedback. Extensive experiments on a dataset derived from EarthBench and ThinkGeo demonstrate that LMMP significantly improves tool-calling accuracy and task success rates. Moreover, the framework exhibits strong ``plug-and-play'' versatility, consistently enhancing the performance of diverse executor backbones across previously unseen EO missions.2026-05-06T11:30:21ZJinghui XuBoyi ShangguanMengke ZhuHao LiuJunhuan JiangGuangjun HePengming FengShichao JinBin LiangYongzhe ChangJunbo TanTiantian ZhangXueqian Wanghttp://arxiv.org/abs/2604.11840v2When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation2026-05-06T09:29:55ZBehavioral simulation and strategic problem solving are different tasks. Large language models are increasingly explored as agents in policy-facing institutional simulations, but stronger reasoning need not improve behavioral sampling. We study this solver-sampler mismatch in three multi-agent negotiation environments: two trading-limits scenarios with different authority structures and a grid-curtailment case in emergency electricity management. Across two primary model families, native reasoning and often no reflection collapse toward authority-heavy outcomes. The sharpest case is DeepSeek native reasoning in the grid-curtailment transfer: it reaches action entropy 1.256 and a concession-arc rate of 0.933, yet still ends in authority decision in 15 of 15 runs. A direct OpenAI extension shows the same pressure at provider breadth: GPT-5.2 native reasoning ends in authority decisions in 45 of 45 runs across the three environments. Budget-matched no-reflection controls and orthogonal private-state controls remain rigid, while the negotiation-structured scaffold condition is the only condition that consistently opens negotiated outcomes. These diagnostics are failure screens within a fixed negotiation grammar, not evidence of external behavioral realism or policy-forecasting validity. The results show that neither more output space nor generic extra private state rescues solver-like sampler failure. For institutional simulation, solver strength and sampler qualification are different objectives: models should be evaluated for the behavioral role they are meant to play, not only for strategic capability.2026-04-12T13:36:10Z12 pages, 7 figures, supplementary material included as ancillary fileSandro Andrichttp://arxiv.org/abs/2605.02463v2When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems2026-05-06T08:34:04ZMulti-agent LLM systems are increasingly used to solve complex tasks through decomposition, debate, specialization, and ensemble reasoning. However, these systems are usually evaluated in terms of robustness: whether performance is preserved under perturbation. This paper studies a different question: whether semantic stress exposes structured variation that could support future antifragile learning. We introduce CAFE (Cognitive Antifragility Framework for Evaluation), a statistical framework for detecting antifragility-compatible regimes in multi-agent architectures. CAFE models a controlled expected distribution of semantic stressors, reconstructs an architecture-specific observed effective stress distribution from multi-dimensional judge signals, and compares both distributions using a distributional Jensen Gap under a convex stress potential. A positive gap does not imply immediate performance improvement; instead, it indicates a convex-expansive deformation of the observed stress distribution, suggesting that the architecture exposes learnable stress structure. We evaluate CAFE on a banking-risk analysis benchmark with five multi-agent architectures: flat, hierarchical, debate, meta-adaptive, and ensemble. Across all architectures, semantic stress reduces average judged quality by roughly one third. Yet all architectures exhibit positive distributional Jensen Gaps with bootstrap confidence intervals above zero. These results show that immediate quality degradation can coexist with statistically detectable antifragility-compatible stress geometry. CAFE is therefore not an antifragile learner itself, but a measurement layer for identifying when and where antifragility learning may be worth applying.2026-05-04T11:06:19ZJose Manuel de la ChicaJuan Manuel VeraJairo Rodríguezhttp://arxiv.org/abs/2605.04637v1SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies2026-05-06T08:30:37ZThe emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative modifications, and maintaining business readiness, we introduce SWE-WebDev Bench, a 68-metric evaluation framework spanning 25 primary and 43 diagnostic metrics across seven groups, organized along three dimensions: Interaction Mode (App Creation Request (ACR) vs. App Modification Request (AMR)), Agency Angle (Product Manager (PM), Engineering, Ops), and Complexity Tier (T4 multi-role SaaS, T5 AI-native).
Our evaluation (six platforms, three domains, 18 evaluation cells) reveals four recurring shortcomings in the current generation of AI app builders: (1) A specification bottleneck, where platforms compress rich business requirements into oversimplified technical plans, (2) A pervasive frontend-backend decoupling, where visually polished UIs mask absent or broken backend infrastructure, (3) A steep production-readiness cliff, where no platform scores above 60% on engineering quality and post-generation human effort varies substantially across platforms and (4) Widespread security and infrastructure failures, with no platform exceeding 65% Security Score against a 90% target and concurrency handling as low as 6%. These observations are descriptive of our sample and require larger-scale replication to establish generality. We release SWE-WebDev Bench as a community benchmark to enable such replication and help platform builders identify and address these gaps.
Code and benchmark resources are available at: https://github.com/snowmountainAi/webdevbench and https://webdevbench.com/.2026-05-06T08:30:37Z35 pages, 12 figures, 18 tablesSiddhant SaxenaNilesh TrivediVinayaka Jyothihttp://arxiv.org/abs/2605.04627v1Autonomous Synchronization of Discrete-Time Heterogeneous Multiagent Systems2026-05-06T08:15:38ZThis paper investigates the autonomous synchronization problem for discrete-time heterogeneous multiagent systems.
The synchronization problem is transformed into the asymptotic decoupling problem of stable modes in a class of discrete-time linear time-varying systems,
for which we provide a sufficient condition.
Leveraging this condition, synchronization conditions are established.
The synchronization conditions are based on the average of the agents' initial dynamic matrices,
without requiring the differences among these matrices to be small.
This approach reduces the conservativeness of existing conditions and achieves a unification of both homogeneous and heterogeneous systems.
Numerical simulation results are provided to support the theoretical findings.2026-05-06T08:15:38Z9 pages, 7 figures, submitted to IEEE Transactions on Control of Network SystemsWei HuQuanyi Lianghttp://arxiv.org/abs/2605.04528v1YOTOnet: Zero-Shot Cross-Domain Fault Diagnosis via Domain-Conditioned Mixture of Experts2026-05-06T06:12:21ZMechanical equipment forms the critical backbone of modern industrial production, yet domain shift severely limits the generalization of deep learning based fault diagnosis models across different equipment and operating conditions.Inspired by the success of foundation models in achieving zero-shotgeneralization, we propose YOTOnet (You Only Train Once), a novel architecture specifically designed for cross-domain fault diagnosis in mechanical equipment.YOTOnet comprises three core components: (1) a physics-aware Invariant Feature Distiller that extracts domain-agnostic representations using multi-scale dilated convolutions and FFT-based time-frequency fusion,(2) Domain-Conditioned Sparse Experts (DC-MoE) that adaptively route inputs to specialized processors via learned gating without external meta-data, and (3) a dual-head classification system with auxiliary supervision.Extensive validation on five public bearing datasets (CWRU, MFPT, XJTU,OTTAWA, HUST) through 30 cross-dataset protocols demonstrates the superiority of YOTOnet compared with other state-of-the-art methods. Critically, we observe a clear scaling effect-average test F1 improves from 0.5339(1 training dataset) to 0.705 (4 datasets), with a clear gain when moving from 3 to 4 datasets. These findings provide empirical evidence that foundation model principles can enable robust, train-once deployment for industrial fault diagnosis.2026-05-06T06:12:21ZZesen WangZihao WuYue HuYang GaoFuzhen Xuanhttp://arxiv.org/abs/2605.04522v1DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration2026-05-06T06:04:30ZWe propose DAO-enabled decentralized physical AI (DePAI), a democratic architecture for coordinating humans and autonomous machines in the operation and governance of physical-digital systems. We (1) synthesize foundations in blockchains, decentralized autonomous organizations (DAOs), and cryptoeconomics; (2) connect DAO design with digital-democracy research on deliberation and voting, showing how each can advance the other; (3) position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots; (4) show how these elements specify workflows that couple machine execution with human oversight, enabling enhanced self-organization of techno-socio-economic systems, which we call DePAI; and (5) analyze risks, including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, and argue for value-sensitive design and continuously adaptive governance. DePAI offers a path to scalable, resilient self-organization that integrates physical infrastructure, AI, and community ownership under transparent rules, on-chain incentives, and permissionless participation, aiming to preserve human autonomy.2026-05-06T06:04:30ZMark C. BallandiesFlorian SpychigerUwe SerdültClaudio J. Tessonehttp://arxiv.org/abs/2602.04129v2KGLAMP: Knowledge Graph-guided Language model for Adaptive Multi-robot Planning and Replanning2026-05-06T00:09:55ZHeterogeneous multi-robot systems are increasingly used in long-horizon missions requiring coordinated planning across diverse capabilities. However, existing planning approaches struggle to construct accurate symbolic representations and maintain plan consistency in dynamic environments. Classical PDDL planners require manually crafted symbolic models, while LLM-based planners often ignore agent heterogeneity and environmental uncertainty. We introduce KGLAMP, a knowledge-graph-guided LLM planning framework for heterogeneous multi-robot teams. The framework maintains a structured knowledge graph encoding object relations, spatial reachability, and robot capabilities, which guides the LLM in generating accurate PDDL problem specifications. The knowledge graph serves as a persistent, dynamically updated memory that incorporates new observations and triggers replanning upon detecting inconsistencies, enabling symbolic plans to adapt to evolving world states. Experiments on the MAT-THOR benchmark show that KGLAMP improves performance by at least 25.3% over both LLM-only and PDDL-based variants.2026-02-04T01:46:02ZChak Lam ShekFaizan M. TariqSangjae BaeDavid IselePiyush Gupta