https://arxiv.org/api/Ia7A03zH/xcAto0gxtkmnayarMs 2026-06-18T14:15:19Z 12677 345 15 http://arxiv.org/abs/2605.30144v1 AgentSchool: An LLM-Powered Multi-Agent Simulation for Education 2026-05-28T16:05:58Z

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

2026-05-28T16:05:58Z 39 pages, 10 figures Yulei Ye Wenhao Li Zhong Wen Yunshu Huang Yichen Hu Zifan Wei Yige Wang Xinyu Xie Haoxuan Yang Yanjun Huang Ruijia Li Hong Qian Yu Song Bo Jiang Bingdong Li Lijun Li Bo Zhang Pinlong Cai Xingcheng Xu Shuangye Chen Xia Hu Liang He Aimin Zhou Jingjing Qu Jing Shao Xiangfeng Wang http://arxiv.org/abs/2605.30102v1 When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems 2026-05-28T15:45:02Z

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.

2026-05-28T15:45:02Z 30 pages, 16 figures. Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026 Corrado Rainone Davide Belli Bence Major Arash Behboodi http://arxiv.org/abs/2602.02751v2 Scaling Small Agents Through Strategy Auctions 2026-05-28T15:33:31Z

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

2026-02-02T20:05:51Z ICML 2026 Lisa Alazraki William F. Shen Yoram Bachrach Akhil Mathur http://arxiv.org/abs/2511.04758v2 ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling 2026-05-28T15:10:59Z

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

2025-11-06T19:17:42Z Project website: https://schedulestream.github.io 2026 IEEE International Conference on Robotics and Automation (ICRA) Caelan Garrett Fabio Ramos http://arxiv.org/abs/2605.30003v1 Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas 2026-05-28T14:33:10Z

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.

2026-05-28T14:33:10Z Accepted to the AI Agents for Discovery in the Wild (AID-Wild) Workshop at ACM CAIS 2026 Víctor Gallego http://arxiv.org/abs/2605.29919v1 On the Geometry of Games and their Solvers 2026-05-28T13:31:49Z

A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.

2026-05-28T13:31:49Z Yaqi Sun Julian Ma David Mguni http://arxiv.org/abs/2605.30391v1 Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate 2026-05-28T12:07:03Z

Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.

2026-05-28T12:07:03Z Master's thesis Tom Pecher http://arxiv.org/abs/2605.29790v1 Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems 2026-05-28T11:40:16Z

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.

2026-05-28T11:40:16Z Zhezheng Hao Tianfu Wang Huanshuo Dong Ziyan Liu Hong Wang Xiankun Lin Qiang Lin Can Wang Hande Dong Jiawei Chen http://arxiv.org/abs/1605.05146v14 Understanding Social-Force Model in Psychological Principles of Collective Behavior 2026-05-28T11:29:26Z

To well understand crowd behavior, microscopic models have been developed in recent decades, in which an individual's behavioral/psychological status can be modeled and simulated. A well-known model is the social-force model innovated by physical scientists (Helbing and Molnar, 1995; Helbing, Farkas and Vicsek, 2000; Helbing et al., 2002). This model has been widely accepted and mainly used in simulation of crowd evacuation in the past decade. A problem, however, is that the testing results of the model were not explained in consistency with the psychological findings, resulting in misunderstanding of the model by psychologists. This paper will bridge the gap between psychological studies and physical explanation about this model. We reinterpret this physics-based model from a psychological perspective, clarifying that the model is consistent with psychological theories on stress, including time-related stress and interpersonal stress. Based on the conception of stress, we renew the model at both micro-and-macro level, referring to multi-agent simulation in a microscopic sense and fluid-based analysis in a macroscopic sense. The cognition and behavior of individual agents are critically modeled as response to environmental stimuli. Existing simulation results such as faster-is-slower effect will be reinterpreted by Yerkes-Dodson law, and herding and grouping effect as well as oscillation phenomenon are further discussed for pedestrian crowd. In brief the social-force model exhibits a bridge between the physics laws and psychological principles regarding crowd motion, and this paper will renew and reinterpret the model on the foundation of psychological studies.

2016-05-17T12:59:31Z 37 pages, 20 figure Peng N. Wang http://arxiv.org/abs/2605.29744v1 Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence 2026-05-28T10:42:44Z

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2026-05-28T10:42:44Z Accepted at ICML 2026. 12 pages main text, 16 pages appendix Yanan Wang Shuaicong Hu Jian Liu Guohui Zhou Aiguo Wang Cuiwei Yang http://arxiv.org/abs/2605.14113v2 ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows 2026-05-28T09:35:28Z

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

2026-05-13T20:57:37Z CVR 2026 Alvaro Lopez Pellicer Plamen Angelov Marwan Bukhari Yi Li Eduardo Soares Jemma Kerns http://arxiv.org/abs/2604.07789v2 ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents 2026-05-28T09:34:24Z

Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.

2026-04-09T04:37:24Z Under peer review; 37 pages, 10 figures, 5 tables Kenan Li Qirui Jin Liao Zhu Xiaosong Huang Yijia Wu Yikai Zhang Xin Zhang Zijian Jin Yufan Huang Elsie Nallipogu Chaoyun Zhang Yu Kang Saravan Rajmohan Qingwei Lin Wenke Lee Dongmei Zhang http://arxiv.org/abs/2508.03253v2 Approximate Proportionality in Online Fair Division 2026-05-28T09:25:00Z

We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. Prior work establishes strong impossibility results for approximating classic notions such as envy-freeness up to one good (EF1) and maximin share (MMS) in this setting, but the approximability of proportionality up to one good (PROP1) has remained unresolved. We resolve this gap in two steps. First, we show that three natural greedy allocation rules (standard baselines in fair division) fail to guarantee any multiplicative approximation to PROP1 against an adaptive adversary. These limitations motivate two relaxations: (i) restricting attention to a non-adaptive adversary, and (ii) incorporating coarse predictions in the spirit of learning-augmented algorithms. Under a non-adaptive adversary, we show that the uniform random allocation achieves a meaningful PROP1 approximation with high probability, and this guarantee is essentially tight for this approach; moreover, when item values are sufficiently small, the allocation is near-PROP1 with high probability. Finally, given maximum item value (MIV) predictions, we design an online algorithm that achieves robust approximation guarantees for PROP1, and degrades gracefully under one-sided prediction error. In contrast, we show that EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.

2025-08-05T09:31:42Z Appears in the 43rd International Conference on Machine Learning (ICML), 2026 Davin Choo Winston Fu Derek Khu Tzeh Yuan Neoh Tze-Yang Poon Nicholas Teh http://arxiv.org/abs/2605.29643v1 AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning 2026-05-28T09:09:41Z

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.

2026-05-28T09:09:41Z Yilun Qiu Jiahe Wang Cilin Yan Jiayin Cai Xiaolong Jiang Yao Hu Chun Yuan http://arxiv.org/abs/2605.29612v1 CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems 2026-05-28T08:47:54Z

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.

2026-05-28T08:47:54Z Ziyang Ma Dingyi Zhang Sichu Liang Jiajia Chu Pengfei Xia Hui Zang Deyu Zhou