https://arxiv.org/api/Zcg9gn9NjyxSQF2DuC8n22A2aOI2026-06-29T07:28:02Z12774138015http://arxiv.org/abs/2603.26718v2Toward Evaluation Frameworks for Multi-Agent Scientific AI Systems2026-04-05T06:20:20ZWe analyze the challenges of benchmarking scientific (multi)-agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications introduced by tool use, and the replication challenges due to the continuously changing/updating knowledge base. We discuss strategies for constructing contamination-resistant problems, generating scalable families of tasks, and the need for evaluating systems through multi-turn interactions that better reflect real scientific practice. As an early feasibility test, we demonstrate how to construct a dataset of novel research ideas to test the out-of-sample performance of our system. We also discuss the results of interviews with several researchers and engineers working in quantum science. Through those interviews, we examine how scientists expect to interact with AI systems and how these expectations should shape evaluation methods.2026-03-18T16:05:52Z14 pages, 4 figuresMarcin Abramhttp://arxiv.org/abs/2604.03955v1Symbolic-Vector Attention Fusion for Collective Intelligence2026-04-05T04:10:15ZWhen autonomous agents observe different domains of a shared environment, each signal they exchange mixes relevant and irrelevant dimensions. No existing mechanism lets the receiver evaluate which dimensions to absorb. We introduce Symbolic-Vector Attention Fusion (SVAF), the content-evaluation half of a two-level coupling engine for collective intelligence. SVAF decomposes each inter-agent signal into 7 typed semantic fields, evaluates each through a learned fusion gate, and produces a remix -- new knowledge from the intersection of two domains. A band-pass model yields four outcomes (redundant, aligned, guarded, rejected), solving both selectivity and redundancy. The fusion gate independently discovers a cross-domain relevance hierarchy: mood emerges as the highest-weight field by epoch 1, before accuracy plateaus -- consistent with independent mechanistic evidence that LLM emotion representations are structurally embedded along valence-arousal axes. SVAF forms Layer 4 of the Mesh Memory Protocol (MMP); the other half of the coupling engine is a per-agent Closed-form Continuous-time (CfC) neural network at Layer 6, whose learned per-neuron time constants (tau) create the temporal dynamics from which collective intelligence emerges: fast neurons synchronise affect across agents in seconds, while slow neurons preserve domain expertise indefinitely. SVAF determines what enters each agent's cognitive state; CfC determines how that state evolves. Trained on 237K samples from 273 narrative scenarios, SVAF achieves 78.7% three-class accuracy. We verify the complete mesh cognition loop -- from per-field evaluation through remix, CfC state evolution, tau-modulated peer blending, and autonomous action -- in a live deployment with 7 nodes across macOS, iOS, and web.2026-04-05T04:10:15Z26 pages, 14 tables, 0 figuresHongwei Xuhttp://arxiv.org/abs/2604.03926v1CODE-GEN: A Human-in-the-Loop RAG-Based Agentic AI System for Multiple-Choice Question Generation2026-04-05T01:37:53ZWe present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.2026-04-05T01:37:53ZFull version of the paper accepted as a short paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)Xiaojing DuanFrederick NwangangaChaoli Wanghttp://arxiv.org/abs/2604.03888v1PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage2026-04-04T22:51:06ZThis paper presents PolySwarm, a novel multi-agent large language model (LLM) framework designed for real-time prediction market trading and latency arbitrage on decentralized platforms such as Polymarket. PolySwarm deploys a swarm of 50 diverse LLM personas that concurrently evaluate binary outcome markets, aggregating individual probability estimates through confidence-weighted Bayesian combination of swarm consensus with market-implied probabilities, and applying quarter-Kelly position sizing for risk-controlled execution. The system incorporates an information-theoretic market analysis engine using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to detect cross-market inefficiencies and negation pair mispricings. A latency arbitrage module exploits stale Polymarket prices by deriving CEX-implied probabilities from a log-normal pricing model and executing trades within the human reaction-time window. We provide a full architectural description, implementation details, and evaluation methodology using Brier scores, calibration analysis, and log-loss metrics benchmarked against human superforecaster performance. We further discuss open challenges including hallucination in agent pools, computational cost at scale, regulatory exposure, and feedback-loop risk, and outline five priority directions for future research. Experimental results demonstrate that swarm aggregation consistently outperforms single-model baselines in probability calibration on Polymarket prediction tasks.2026-04-04T22:51:06Z13 pages, 3 figures, 3 tablesRajat M. BarotArjun S. Borkhatariyahttp://arxiv.org/abs/2604.03872v1Strategies in Sabotage Games: Temporal and Epistemic Perspectives2026-04-04T21:38:15ZSabotage games are played on a dynamic graph, in which one agent, called a runner, attempts to reach a goal state, while being obstructed by a demon who at each round removes an edge from the graph. Sabotage modal logic was proposed to carry out reasoning about such games. Since its conception, it has undergone a thorough analysis (in terms of complexity, completeness, and various extensions) and has been applied to a variety of domains, e.g., to formal learning. In this paper, we propose examining the game from a temporal perspective using alternating time temporal logic (ATL$^\ast$), and address the players' uncertainty in its epistemic extensions. This framework supports reasoning about winning strategies for those games, and opens ways to address temporal properties of dynamic graphs in general.2026-04-04T21:38:15Z18 pages, 3 figuresNina GierasimczukKatrine B. P. Thofthttp://arxiv.org/abs/2311.00855v3A Multi-Agent Reinforcement Learning Framework for Public Health Decision Analysis2026-04-04T20:23:54ZHuman immunodeficiency virus (HIV) is a major public health concern in the United States (U.S.), with about 1.2 million people living with it and about 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The 'Ending the HIV Epidemic (EHE)' initiative by the U.S. Department of Health and Human Services aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. We develop intelligent decision-support systems to optimize resource allocation and intervention strategies. Existing decision analytic models either focus on individual cities or aggregate national data, failing to capture jurisdictional interactions critical for optimizing intervention strategies. To address this, we propose a multi-agent reinforcement learning (MARL) framework that enables jurisdiction-specific decision-making while accounting for cross-jurisdictional epidemiological interactions. Our framework functions as an intelligent resource optimization system, helping policymakers strategically allocate interventions based on dynamic, data-driven insights. Experimental results across jurisdictions in California and Florida demonstrate that MARL-driven policies outperform traditional single-agent reinforcement learning approaches by reducing new infections under fixed budget constraints. Our study highlights the importance of incorporating jurisdictional dependencies in decision-making frameworks for large-scale public initiatives. By integrating multi-agent intelligent systems, decision analytics, and reinforcement learning, this study advances expert systems for government resource planning and public health management, offering a scalable framework for broader applications in healthcare policy and epidemic management.2023-11-01T21:19:35ZUpdated to the accepted version published in Healthcare Analytics (November 2025)Healthcare Analytics, 8 (2025) 100436Dinesh SharmaAnkit ShahChaitra Gopalappa10.1016/j.health.2025.100436http://arxiv.org/abs/2604.03818v1Investigating the Impact of Subgraph Social Structure Preference on the Strategic Behavior of Networked Mixed-Motive Learning Agents2026-04-04T17:56:35ZLimited work has examined the strategic behaviors of relational networked learning agents under social dilemmas, and has overlooked the intricate social dynamics of complex systems. We address the challenge with Socio-Relational Intrinsic Motivation (SRIM), which endows agents with diverse preferences over sub-graphical social structures in order to study the impact of agents' personal preferences over their sub-graphical relations on their strategic decision-making under sequential social dilemmas. Our results in the Harvest and Cleanup environments demonstrate that preferences over different subgraph structures (degree-, clique-, and critical connection-based) lead to distinct variations in agents' reward gathering and strategic behavior: individual aggressiveness in Harvest and individual contribution effort in Cleanup. Moreover, agents with different subgraphical structural positions consistently exhibit similar strategic behavioral shifts. Our proposed BCI metric captures structural variation within the population, and the relative ordering of BCI across social preferences is consistent in Harvest and Cleanup games for the same topology, suggesting the subgraphical structural impact is robust across environments. These results provide a new lens for examining agents' behavior in social dilemmas and insight for designing effective multi-agent ecosystems composed of heterogeneous social agents.2026-04-04T17:56:35Z17 pages, 8 page manuscript and 9 page appendix, 10 figuresXinqi GaoMario Ventrescahttp://arxiv.org/abs/2604.03809v1Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus2026-04-04T17:30:23ZMulti-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.2026-04-04T17:30:23Z11 pages, 2 figures, 7 tablesDipkumar Patelhttp://arxiv.org/abs/2604.03796v1When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation2026-04-04T16:59:29ZWhen LLM-based multi-agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d>0.8) surviving correction for multiple comparisons. Our taxonomy-based ordering correlates with human disagreement patterns. These preliminary findings motivate a shift from consensus-seeking to uncertainty-surfacing multi-agent design, where disagreement structure - not magnitude - guides when human judgment is needed.2026-04-04T16:59:29ZAccepted to the ICLR 2026 Workshop on "From Human Cognition to AI Reasoning: Models, Methods, and Applications (HCAIR)Michał WawerJarosław A. Chudziakhttp://arxiv.org/abs/2603.27771v2Emergent Social Intelligence Risks in Generative Multi-Agent Systems2026-04-04T07:45:49ZMulti-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.2026-03-29T17:10:28ZYue HuangYu JiangWenjie WangHaomin ZhuangXiaonan LuoYuchen MaZhangchen XuZichen ChenNuno MonizZinan LinPin-Yu ChenNitesh V ChawlaNouha DziriHuan SunXiangliang Zhanghttp://arxiv.org/abs/2604.00449v2Convergence of Byzantine-Resilient Gradient Tracking via Probabilistic Edge Dropout2026-04-04T05:44:27ZWe study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emph{Gradient Tracking with Probabilistic Edge Dropout} (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius $τ$ around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ($p_b=0$), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ($p_b>0$), we introduce \emph{Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration} (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with $p_h=1$, isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.2026-04-01T03:55:42ZAmirhossein DezhboroFateme MalekiArman AdibiErfan AminiJose E. Ramirez-Marquezhttp://arxiv.org/abs/2603.09127v2Collective AI can amplify tiny perturbations into divergent decisions2026-04-03T23:37:21ZLarge language models are increasingly deployed not as single assistants but as committees whose members deliberate and then vote or synthesize a decision. Such systems are often expected to be more robust than individual models. We show that iterative multi-LLM deliberation can instead amplify tiny perturbations into divergent conversational trajectories and different final decisions. In a fully deterministic self-hosted benchmark, exact reruns are identical, yet small meaning-preserving changes to the scenario text still separate over time and often alter the final recommendation. In deployed black-box API systems, nominally identical committee runs likewise remain unstable even at temperature 0, where many users expect near-determinism. Across 12 policy scenarios, these findings indicate that instability in collective AI is not only a consequence of residual platform-side stochasticity, but can arise from sensitivity to nearby initial conditions under repeated interaction itself. Additional deployed experiments show that committee architecture modulates this instability: role structure, model composition, and feedback memory can each alter the degree of divergence. Collective AI therefore faces a stability problem, not only an accuracy problem: deterministic execution alone does not guarantee predictable or auditable deliberative outcomes.2026-03-10T02:59:11ZMain text: 9 pages, 4 figures;Hajime ShimaoWarut Khern-am-nuaiSung Joo Kimhttp://arxiv.org/abs/2604.03056v1A Network Formation Game for Katz Centrality Maximization: A Resource Allocation Perspective2026-04-03T14:12:00ZIn this paper, we study a network formation game in which agents seek to maximize their influence by allocating constrained resources to choose connections with other agents. In particular, we use Katz centrality to model agents' influence in the network. Allocations are restricted to neighbors in a given unweighted network encoding topological constraints. The allocations by an agent correspond to the weights of its outgoing edges. Such allocation by all agents thereby induces a network. This models a strategic-form game in which agents' utilities are given by their Katz centralities. We characterize the Nash equilibrium networks of this game and analyze their properties. We propose a sequential best-response dynamics (BRD) to model the network formation process. We show that it converges to the set of Nash equilibria under very mild assumptions. For complete underlying topologies, we show that Katz centralities are proportional to agents' budgets at Nash equilibria. For general underlying topologies in which each agent has a self-loop, we show that hierarchical networks form at Nash equilibria. Finally, simulations illustrate our findings.2026-04-03T14:12:00ZSubmitted to the 65th IEEE Conference on Decision and Control (CDC), 2026. (8 pages, 5 figures)Balaji RPrashil WankhedePavankumar Tallapragadahttp://arxiv.org/abs/2212.00292v2Economics of NFTs: The Value of Creator Royalties2026-04-03T12:13:16ZNon-Fungible Tokens (NFTs) are transforming how content creators, such as artists, price and sell their work. A key feature of NFTs is the inclusion of royalties, which grant creators a share of all future resale proceeds. Although widely used, critics argue that sophisticated speculators, who dominate NFT markets, simply price in royalties upfront, neutralizing their impact. We show this intuition holds only under perfect, frictionless markets. Under more realistic market conditions, royalties enable creators to capitalize on the presence of speculators in at least three ways: They can enable risk sharing (under risk aversion), mitigate information asymmetry (when speculators are better informed), and unlock price discrimination benefits (in multi-unit settings). Moreover, in all three cases, royalties meaningfully expand trade, implying increased transaction volume for platforms. These results offer testable predictions that can guide both empirical research and platform design.2022-12-01T05:35:23ZBrett Hemenway FalkGerry TsoukalasNiuniu Zhanghttp://arxiv.org/abs/2604.02791v1Fully Byzantine-Resilient Distributed Multi-Agent Q-Learning2026-04-03T06:57:45ZWe study Byzantine-resilient distributed multi-agent reinforcement learning (MARL), where agents must collaboratively learn optimal value functions over a compromised communication network. Existing resilient MARL approaches typically guarantee almost sure convergence only to near-optimal value functions, or require restrictive assumptions to ensure convergence to optimal solution. As a result, agents may fail to learn the optimal policies under these methods. To address this, we propose a novel distributed Q-learning algorithm, under which all agents' value functions converge almost surely to the optimal value functions despite Byzantine edge attacks. The key idea is a redundancy-based filtering mechanism that leverages two-hop neighbor information to validate incoming messages, while preserving bidirectional information flow. We then introduce a new topological condition for the convergence of our algorithm, present a systematic method to construct such networks, and prove that this condition can be verified in polynomial time. We validate our results through simulations, showing that our method converges to the optimal solutions, whereas prior methods fail under Byzantine edge attacks.2026-04-03T06:57:45Z8 pages, 3 figures, submitted to 2026 IEEE Conference on Decision and Control (CDC)Haejoon LeeDimitra Panagou