LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

2026-05-31T23:15:40Z

We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). We report four core findings. First, structural adversarial (v4b) ranks #1 by ensemble -- a prompt-engineered adversarial variant that demands rewrite mandates rather than patches (weighted ensemble: 4.637/5.0). Second, cross-model review wins unanimously at #2 -- generate with one model, review with another -- ranking #2 by all three evaluators (weighted ensemble: 4.606). Third, evaluator diversity is itself a finding -- all three evaluators agree v4b is best and v3 is worst, but disagree sharply on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), revealing how different model families weight design qualities. Fourth, parallel merge is fundamentally broken -- all three evaluators place merge variants in the bottom tier (3.65-3.79), due to token starvation and the Frankenstein effect. The weighted ensemble ($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS) provides robust rankings across 520 runs, confirmed through independent cross-validation.

Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models

2026-05-31T21:43:11Z

Developing effective anticancer therapeutics remains challenging due to tumor heterogeneity and the absence of well-defined molecular targets across cancer subtypes. Generative models conditioned on cancer genotypes offer a promising avenue for personalized drug discovery, yet existing approaches lack explicit optimization for simultaneous sensitivity, synthesizability, and mechanistic binding plausibility. We present a latent-space optimization approach for a pretrained genotype-to-drug diffusion model, introducing a learnable perturbation over the molecular latent space optimized via gradient ascent to maximize a composite reward combining predicted drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS). Critically, biological realism is enforced by grounding both reward design and evaluation in experimentally-derived cancer cell line data and validated pharmacologic signals, anchoring candidate generation in real-world clinical evidence. Mechanistic consistency plausibility is further assessed by a multi-agent LLM pipeline grounded in the diffusion model's attention mechanism. Experiments across 15 cancer cell lines from three held-out evaluation sets demonstrate consistent and noticeable improvements over competing baselines in sensitivity, drug-likeness, synthesizability, and chemical validity.

CrazyMARL: Decentralized Direct Motor Control Policies for Cooperative Aerial Transport of Cable-Suspended Payloads

2026-05-31T20:37:28Z

Collaborative transportation of cable-suspended payloads by teams of UAVs has the potential to enhance payload capacity, adapt to different payload shapes, and provide built-in compliance, making it attractive for applications ranging from disaster relief to precision logistics. However, multi-UAV coordination under disturbances, nonlinear payload dynamics, and slack-taut cable modes remains a challenging control problem. To our knowledge, no prior work has addressed these cable mode transitions in the multi-UAV context, instead relying on simplifying rigid-link assumptions. We propose CrazyMARL, a decentralized RL framework for multi-UAV cable-suspended payload transport. Simulation results demonstrate that the learned policies can outperform classical decentralized controllers in terms of disturbance rejection and tracking precision, achieving an 80% recovery rate from harsh conditions compared to 44% for the baseline method. We also achieve successful zero-shot sim-to-real transfer and demonstrate that our policies are highly robust under harsh conditions, including wind, random external disturbances, and transitions between slack and taut cable dynamics. This work paves the way for autonomous, resilient UAV teams capable of executing complex payload missions in unstructured environments. Code and videos can be found on the website: https://imrclab.github.io/CrazyMARL.

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

2026-05-31T16:19:54Z

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($β= 0$), sublinear at $N^β/c$ ($0 < β< 1$), or linear ($β\ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $τ$ during agent debate enter the dynamics only through their product $kτ$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, β)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

2026-05-31T16:00:10Z

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

2026-05-31T14:07:57Z

Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees

2026-05-31T11:22:16Z

The application of multi-agent systems in robotics is a very challenging field. Several competitions involving such systems are proposed to foster research and development of strategies and mechanisms using games as the underlying domain. Among them are the ones from the \textit{IEEE Very Small Soccer (VSSS)} category, which is the case study described in this paper. In VSSS, two teams of three robots each compete in a very dynamic environment of a soccer game. Thus, coordination of robots' behavior during the game is crucial to win it. In this paper, we present a Behavior-Tree-based approach to support multi-robot coordination within the VSSS team of the ThundeRatz robotics team from the Universidade de S$\tilde{a}$o Paulo. Moreover, a comparison between the proposed approach and the previous one, which was based on a Finite State Machine (FSM), was conducted using the FIRASim simulator. Besides that, the performance of this new strategy was further evaluated in an academic robotics competition.

Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies

2026-05-31T10:13:46Z

Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost of fraud, difficulty in evaluating service quality, and instability of service content. These compounding vulnerabilities can trigger population-level trust collapse and the proliferation of short-sighted strategies. We propose Ev-Trust, an evolutionarily stable trust mechanism that addresses these vulnerabilities through three targeted designs: a cross-validation gate leveraging requestor semantic comprehension to assess response validity, a variance-standardized drift measure filtering endogenous stochasticity from genuine behavioral anomalies, and an embedding of trust signals into the expected revenue function that converts trustworthiness into an evolutionary survival advantage. Based on replicator dynamics with a noisy best response micro-foundation, we prove the asymptotic stability of cooperative evolutionarily stable strategies and derive explicit threshold conditions for maintaining cooperative equilibria. We evaluate Ev-Trust through 100-round simulations with at least 100 heterogeneous LLM-driven agents covering seven behavioral types. The experiments are conducted on TruthfulQA and TriviaQA, two factual question-answering benchmarks. Compared to baselines based on transitive trust aggregation, reinforcement-learning reputation, and pure evolutionary imitation, Ev-Trust reduces malicious agent participation by approximately 60%, suppresses the fraudulent service rate by approximately 50%, and maintains stable trust differentiation under a 30% adversarial mutation. These results demonstrate that coupling semantic trust evaluation with evolutionary incentives provides a principled foundation for securing cooperation in decentralized LLM-based multi-agent systems.

Scalable Ride-Sourcing Vehicle Rebalancing with Service Accessibility Guarantee: A Constrained Mean-Field Reinforcement Learning Approach

2026-05-31T02:42:34Z

The expansion of ride-sourcing services such as Uber and Lyft has reshaped urban transportation by offering flexible, on-demand mobility via mobile applications. Despite convenience, these platforms confront significant operational challenges, particularly vehicle rebalancing-strategic repositioning of a fleet of vehicles to address spatiotemporal mismatches in supply and demand. Inadequate rebalancing results in prolonged rider waiting times and inefficient vehicle utilization, but also leads to fairness issues, such as the inequitable distribution of service and disparities in driver income. To tackle these, we introduce continuous-state mean-field control (MFC) and mean-field reinforcement learning (MFRL) models with continuous repositioning actions. MFC and MFRL offer scalable solutions by modeling each vehicle's behavior through interaction with the vehicle distribution, rather than with individual vehicles. This mitigates the curse of dimensionality with respect to the number of agents, enabling coordination across large fleets with significantly reduced computational complexity and eliminating the need to retrain the model when fleet size changes. To ensure equitable service access across geographic regions, we integrate an accessibility constraint into models and derive rebalancing policies that strike a balance between high fulfillment of rider demand and fair coverage of vehicle supply. Extensive evaluation using data-driven simulation of Shenzhen demonstrates the efficiency and robustness of our approach. Remarkably, it scales to tens of thousands of vehicles, with training times comparable to linear programming rebalancing. Besides, our policies effectively explore the efficiency-equity Pareto front, outperforming conventional benchmarks across key metrics like fleet utilization, fulfilled requests, and pickup distance, while ensuring equitable service access.

When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

2026-05-31T02:10:12Z

Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems.

FinCom: A Financial Multi-Agent Demo with Disagree-or-Commit Deliberation

2026-05-31T00:53:05Z

Multi-agent systems powered by large language models (LLMs) are increasingly used for financial analysis and decision support. However, existing coordination schemes, especially those emphasizing consensus or debate, are vulnerable to sycophancy: agents conform to peer reasoning instead of evidence, leading to premature agreement and degraded outcomes. We introduce FinCom (Financial Committee), a governed multi-agent framework and interactive system that operationalizes the Disagree-or-Commit (DoC) protocol to embed structured dissent into financial AI committees. A central Supervisor orchestrates three ReAct-enabled specialist agents: Research, Quantitative, and Risk. Each agent is equipped with role-specific tools for retrieval, computation, and stress testing. During deliberation, agents must either explicitly critique or commit to their peers' reasoning before converging on a unified recommendation. This demonstration showcases how FinCom supports committee-style financial analysis through coordinated multi-agent interaction, including structured report generation and interactive decision support. Evaluated across the most recent financial agent benchmark, in addition to 90 internal handcrafted financial tasks using an LLM-as-a-Judge protocol, DoC improves reasoning accuracy and risk awareness significantly over a consensus-seeking baseline on both an in-house and external evaluation set. By reframing disagreement as a governance primitive rather than noise, FinCom offers a lightweight, prompt-only recipe for improving accountability, transparency, and epistemic robustness in agentic financial systems.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

2026-05-30T17:53:10Z

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.

Proactive-reactive detection and mitigation of intermittent faults in robot swarms

2026-05-30T12:30:36Z

Intermittent faults are transient errors that sporadically appear and disappear. Although intermittent faults pose substantial challenges to reliability and coordination, existing studies of fault tolerance in robot swarms focus instead on permanent faults. One reason for this is that intermittent faults are prohibitively difficult to detect in the fully self-organized ad-hoc networks typical of robot swarms, as their network topologies are transient and often unpredictable. However, in the recently introduced self-organizing nervous systems (SoNS) approach, robot swarms are able to self-organize persistent network structures for the first time, easing the problem of detecting intermittent faults. To address intermittent faults in robot swarms that have persistent networks, we propose a novel proactive-reactive strategy to detection and mitigation, based on self-organized backup layers and distributed consensus in a multiplex network. Proactively, the robots self-organize dynamic backup paths before faults occur, adapting to changes in the primary network topology and the robots' relative positions. Reactively, robots use one-shot likelihood ratio tests to compare information received along different paths in the multiplex network, enabling early fault detection. Upon detection, communication is temporarily rerouted in a self-organized way, until the detected fault resolves. We validate the approach in representative scenarios of faulty positional data occurring during formation control, demonstrating that intermittent faults are prevented from disrupting convergence to desired formations, with high fault detection accuracy and low rates of false positives.

Scaling Behavior of Single LLM-Driven Multi-Agent Systems

2026-05-30T09:57:49Z

The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain underexplored. This paper systematically investigates how the performance of a homogeneous MAS evolves as the number of agents increases, isolating the variable of collaboration from model or knowledge heterogeneity. We propose the Sequential Iterative Multi-Agent System (SIMAS) framework, a minimalist architecture centered on sequential inter-agent communication, to clearly observe scaling effects. Through extensive experiments across diverse tasks and model scales, we establish that MAS performance does not scale monotonically with agent count but follows a pattern of diminishing returns, governed by a trade-off between collaborative synergy and coordination overhead. Our findings reveal that effective MAS requires a sufficiently capable base LLM, that task type critically modulates the optimal agent count, and that collective intelligence is an emergent property contingent on strategic interaction design rather than a guaranteed outcome of agent plurality. The performance degradation stems coordination overhead rather than merely long-context failure, and the scaling tendency generalizes across interaction architectures like structured debate topologies. This work provides a foundational understanding of MAS scaling laws, offering practical guidance for designing efficient collaborative systems and challenging the prevailing assumption that more agents invariably lead to better performance.

MARFT: Multi-Agent Reinforcement Fine-Tuning

2026-05-30T09:48:39Z

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.