https://arxiv.org/api/Zcg9gn9NjyxSQF2DuC8n22A2aOI 2026-06-29T07:28:02Z 12774 1380 15 http://arxiv.org/abs/2603.26718v2 Toward Evaluation Frameworks for Multi-Agent Scientific AI Systems 2026-04-05T06:20:20Z

We analyze the challenges of benchmarking scientific (multi)-agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications introduced by tool use, and the replication challenges due to the continuously changing/updating knowledge base. We discuss strategies for constructing contamination-resistant problems, generating scalable families of tasks, and the need for evaluating systems through multi-turn interactions that better reflect real scientific practice. As an early feasibility test, we demonstrate how to construct a dataset of novel research ideas to test the out-of-sample performance of our system. We also discuss the results of interviews with several researchers and engineers working in quantum science. Through those interviews, we examine how scientists expect to interact with AI systems and how these expectations should shape evaluation methods.

2026-03-18T16:05:52Z 14 pages, 4 figures Marcin Abram http://arxiv.org/abs/2604.03955v1 Symbolic-Vector Attention Fusion for Collective Intelligence 2026-04-05T04:10:15Z

When autonomous agents observe different domains of a shared environment, each signal they exchange mixes relevant and irrelevant dimensions. No existing mechanism lets the receiver evaluate which dimensions to absorb. We introduce Symbolic-Vector Attention Fusion (SVAF), the content-evaluation half of a two-level coupling engine for collective intelligence. SVAF decomposes each inter-agent signal into 7 typed semantic fields, evaluates each through a learned fusion gate, and produces a remix -- new knowledge from the intersection of two domains. A band-pass model yields four outcomes (redundant, aligned, guarded, rejected), solving both selectivity and redundancy. The fusion gate independently discovers a cross-domain relevance hierarchy: mood emerges as the highest-weight field by epoch 1, before accuracy plateaus -- consistent with independent mechanistic evidence that LLM emotion representations are structurally embedded along valence-arousal axes. SVAF forms Layer 4 of the Mesh Memory Protocol (MMP); the other half of the coupling engine is a per-agent Closed-form Continuous-time (CfC) neural network at Layer 6, whose learned per-neuron time constants (tau) create the temporal dynamics from which collective intelligence emerges: fast neurons synchronise affect across agents in seconds, while slow neurons preserve domain expertise indefinitely. SVAF determines what enters each agent's cognitive state; CfC determines how that state evolves. Trained on 237K samples from 273 narrative scenarios, SVAF achieves 78.7% three-class accuracy. We verify the complete mesh cognition loop -- from per-field evaluation through remix, CfC state evolution, tau-modulated peer blending, and autonomous action -- in a live deployment with 7 nodes across macOS, iOS, and web.

2026-04-05T04:10:15Z 26 pages, 14 tables, 0 figures Hongwei Xu http://arxiv.org/abs/2604.03926v1 CODE-GEN: A Human-in-the-Loop RAG-Based Agentic AI System for Multiple-Choice Question Generation 2026-04-05T01:37:53Z

We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.

2026-04-05T01:37:53Z Full version of the paper accepted as a short paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026) Xiaojing Duan Frederick Nwanganga Chaoli Wang http://arxiv.org/abs/2604.03888v1 PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage 2026-04-04T22:51:06Z

This paper presents PolySwarm, a novel multi-agent large language model (LLM) framework designed for real-time prediction market trading and latency arbitrage on decentralized platforms such as Polymarket. PolySwarm deploys a swarm of 50 diverse LLM personas that concurrently evaluate binary outcome markets, aggregating individual probability estimates through confidence-weighted Bayesian combination of swarm consensus with market-implied probabilities, and applying quarter-Kelly position sizing for risk-controlled execution. The system incorporates an information-theoretic market analysis engine using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to detect cross-market inefficiencies and negation pair mispricings. A latency arbitrage module exploits stale Polymarket prices by deriving CEX-implied probabilities from a log-normal pricing model and executing trades within the human reaction-time window. We provide a full architectural description, implementation details, and evaluation methodology using Brier scores, calibration analysis, and log-loss metrics benchmarked against human superforecaster performance. We further discuss open challenges including hallucination in agent pools, computational cost at scale, regulatory exposure, and feedback-loop risk, and outline five priority directions for future research. Experimental results demonstrate that swarm aggregation consistently outperforms single-model baselines in probability calibration on Polymarket prediction tasks.

2026-04-04T22:51:06Z 13 pages, 3 figures, 3 tables Rajat M. Barot Arjun S. Borkhatariya http://arxiv.org/abs/2604.03872v1 Strategies in Sabotage Games: Temporal and Epistemic Perspectives 2026-04-04T21:38:15Z

Sabotage games are played on a dynamic graph, in which one agent, called a runner, attempts to reach a goal state, while being obstructed by a demon who at each round removes an edge from the graph. Sabotage modal logic was proposed to carry out reasoning about such games. Since its conception, it has undergone a thorough analysis (in terms of complexity, completeness, and various extensions) and has been applied to a variety of domains, e.g., to formal learning. In this paper, we propose examining the game from a temporal perspective using alternating time temporal logic (ATL$^\ast$), and address the players' uncertainty in its epistemic extensions. This framework supports reasoning about winning strategies for those games, and opens ways to address temporal properties of dynamic graphs in general.

2026-04-04T21:38:15Z 18 pages, 3 figures Nina Gierasimczuk Katrine B. P. Thoft http://arxiv.org/abs/2311.00855v3 A Multi-Agent Reinforcement Learning Framework for Public Health Decision Analysis 2026-04-04T20:23:54Z

Human immunodeficiency virus (HIV) is a major public health concern in the United States (U.S.), with about 1.2 million people living with it and about 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The 'Ending the HIV Epidemic (EHE)' initiative by the U.S. Department of Health and Human Services aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. We develop intelligent decision-support systems to optimize resource allocation and intervention strategies. Existing decision analytic models either focus on individual cities or aggregate national data, failing to capture jurisdictional interactions critical for optimizing intervention strategies. To address this, we propose a multi-agent reinforcement learning (MARL) framework that enables jurisdiction-specific decision-making while accounting for cross-jurisdictional epidemiological interactions. Our framework functions as an intelligent resource optimization system, helping policymakers strategically allocate interventions based on dynamic, data-driven insights. Experimental results across jurisdictions in California and Florida demonstrate that MARL-driven policies outperform traditional single-agent reinforcement learning approaches by reducing new infections under fixed budget constraints. Our study highlights the importance of incorporating jurisdictional dependencies in decision-making frameworks for large-scale public initiatives. By integrating multi-agent intelligent systems, decision analytics, and reinforcement learning, this study advances expert systems for government resource planning and public health management, offering a scalable framework for broader applications in healthcare policy and epidemic management.

2023-11-01T21:19:35Z Updated to the accepted version published in Healthcare Analytics (November 2025) Healthcare Analytics, 8 (2025) 100436 Dinesh Sharma Ankit Shah Chaitra Gopalappa 10.1016/j.health.2025.100436 http://arxiv.org/abs/2604.03818v1 Investigating the Impact of Subgraph Social Structure Preference on the Strategic Behavior of Networked Mixed-Motive Learning Agents 2026-04-04T17:56:35Z

Limited work has examined the strategic behaviors of relational networked learning agents under social dilemmas, and has overlooked the intricate social dynamics of complex systems. We address the challenge with Socio-Relational Intrinsic Motivation (SRIM), which endows agents with diverse preferences over sub-graphical social structures in order to study the impact of agents' personal preferences over their sub-graphical relations on their strategic decision-making under sequential social dilemmas. Our results in the Harvest and Cleanup environments demonstrate that preferences over different subgraph structures (degree-, clique-, and critical connection-based) lead to distinct variations in agents' reward gathering and strategic behavior: individual aggressiveness in Harvest and individual contribution effort in Cleanup. Moreover, agents with different subgraphical structural positions consistently exhibit similar strategic behavioral shifts. Our proposed BCI metric captures structural variation within the population, and the relative ordering of BCI across social preferences is consistent in Harvest and Cleanup games for the same topology, suggesting the subgraphical structural impact is robust across environments. These results provide a new lens for examining agents' behavior in social dilemmas and insight for designing effective multi-agent ecosystems composed of heterogeneous social agents.

2026-04-04T17:56:35Z 17 pages, 8 page manuscript and 9 page appendix, 10 figures Xinqi Gao Mario Ventresca http://arxiv.org/abs/2604.03809v1 Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus 2026-04-04T17:30:23Z

Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.

2026-04-04T17:30:23Z 11 pages, 2 figures, 7 tables Dipkumar Patel http://arxiv.org/abs/2604.03796v1 When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation 2026-04-04T16:59:29Z

When LLM-based multi-agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d>0.8) surviving correction for multiple comparisons. Our taxonomy-based ordering correlates with human disagreement patterns. These preliminary findings motivate a shift from consensus-seeking to uncertainty-surfacing multi-agent design, where disagreement structure - not magnitude - guides when human judgment is needed.

2026-04-04T16:59:29Z Accepted to the ICLR 2026 Workshop on "From Human Cognition to AI Reasoning: Models, Methods, and Applications (HCAIR) Michał Wawer Jarosław A. Chudziak http://arxiv.org/abs/2603.27771v2 Emergent Social Intelligence Risks in Generative Multi-Agent Systems 2026-04-04T07:45:49Z

Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.

2026-03-29T17:10:28Z Yue Huang Yu Jiang Wenjie Wang Haomin Zhuang Xiaonan Luo Yuchen Ma Zhangchen Xu Zichen Chen Nuno Moniz Zinan Lin Pin-Yu Chen Nitesh V Chawla Nouha Dziri Huan Sun Xiangliang Zhang http://arxiv.org/abs/2604.00449v2 Convergence of Byzantine-Resilient Gradient Tracking via Probabilistic Edge Dropout 2026-04-04T05:44:27Z

We study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emph{Gradient Tracking with Probabilistic Edge Dropout} (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius $τ$ around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ($p_b=0$), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ($p_b>0$), we introduce \emph{Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration} (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with $p_h=1$, isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.

2026-04-01T03:55:42Z Amirhossein Dezhboro Fateme Maleki Arman Adibi Erfan Amini Jose E. Ramirez-Marquez http://arxiv.org/abs/2603.09127v2 Collective AI can amplify tiny perturbations into divergent decisions 2026-04-03T23:37:21Z

Large language models are increasingly deployed not as single assistants but as committees whose members deliberate and then vote or synthesize a decision. Such systems are often expected to be more robust than individual models. We show that iterative multi-LLM deliberation can instead amplify tiny perturbations into divergent conversational trajectories and different final decisions. In a fully deterministic self-hosted benchmark, exact reruns are identical, yet small meaning-preserving changes to the scenario text still separate over time and often alter the final recommendation. In deployed black-box API systems, nominally identical committee runs likewise remain unstable even at temperature 0, where many users expect near-determinism. Across 12 policy scenarios, these findings indicate that instability in collective AI is not only a consequence of residual platform-side stochasticity, but can arise from sensitivity to nearby initial conditions under repeated interaction itself. Additional deployed experiments show that committee architecture modulates this instability: role structure, model composition, and feedback memory can each alter the degree of divergence. Collective AI therefore faces a stability problem, not only an accuracy problem: deterministic execution alone does not guarantee predictable or auditable deliberative outcomes.

2026-03-10T02:59:11Z Main text: 9 pages, 4 figures; Hajime Shimao Warut Khern-am-nuai Sung Joo Kim http://arxiv.org/abs/2604.03056v1 A Network Formation Game for Katz Centrality Maximization: A Resource Allocation Perspective 2026-04-03T14:12:00Z

In this paper, we study a network formation game in which agents seek to maximize their influence by allocating constrained resources to choose connections with other agents. In particular, we use Katz centrality to model agents' influence in the network. Allocations are restricted to neighbors in a given unweighted network encoding topological constraints. The allocations by an agent correspond to the weights of its outgoing edges. Such allocation by all agents thereby induces a network. This models a strategic-form game in which agents' utilities are given by their Katz centralities. We characterize the Nash equilibrium networks of this game and analyze their properties. We propose a sequential best-response dynamics (BRD) to model the network formation process. We show that it converges to the set of Nash equilibria under very mild assumptions. For complete underlying topologies, we show that Katz centralities are proportional to agents' budgets at Nash equilibria. For general underlying topologies in which each agent has a self-loop, we show that hierarchical networks form at Nash equilibria. Finally, simulations illustrate our findings.

2026-04-03T14:12:00Z Submitted to the 65th IEEE Conference on Decision and Control (CDC), 2026. (8 pages, 5 figures) Balaji R Prashil Wankhede Pavankumar Tallapragada http://arxiv.org/abs/2212.00292v2 Economics of NFTs: The Value of Creator Royalties 2026-04-03T12:13:16Z

Non-Fungible Tokens (NFTs) are transforming how content creators, such as artists, price and sell their work. A key feature of NFTs is the inclusion of royalties, which grant creators a share of all future resale proceeds. Although widely used, critics argue that sophisticated speculators, who dominate NFT markets, simply price in royalties upfront, neutralizing their impact. We show this intuition holds only under perfect, frictionless markets. Under more realistic market conditions, royalties enable creators to capitalize on the presence of speculators in at least three ways: They can enable risk sharing (under risk aversion), mitigate information asymmetry (when speculators are better informed), and unlock price discrimination benefits (in multi-unit settings). Moreover, in all three cases, royalties meaningfully expand trade, implying increased transaction volume for platforms. These results offer testable predictions that can guide both empirical research and platform design.

2022-12-01T05:35:23Z Brett Hemenway Falk Gerry Tsoukalas Niuniu Zhang http://arxiv.org/abs/2604.02791v1 Fully Byzantine-Resilient Distributed Multi-Agent Q-Learning 2026-04-03T06:57:45Z

We study Byzantine-resilient distributed multi-agent reinforcement learning (MARL), where agents must collaboratively learn optimal value functions over a compromised communication network. Existing resilient MARL approaches typically guarantee almost sure convergence only to near-optimal value functions, or require restrictive assumptions to ensure convergence to optimal solution. As a result, agents may fail to learn the optimal policies under these methods. To address this, we propose a novel distributed Q-learning algorithm, under which all agents' value functions converge almost surely to the optimal value functions despite Byzantine edge attacks. The key idea is a redundancy-based filtering mechanism that leverages two-hop neighbor information to validate incoming messages, while preserving bidirectional information flow. We then introduce a new topological condition for the convergence of our algorithm, present a systematic method to construct such networks, and prove that this condition can be verified in polynomial time. We validate our results through simulations, showing that our method converges to the optimal solutions, whereas prior methods fail under Byzantine edge attacks.

2026-04-03T06:57:45Z 8 pages, 3 figures, submitted to 2026 IEEE Conference on Decision and Control (CDC) Haejoon Lee Dimitra Panagou