https://arxiv.org/api/cFaHP2VRZmKL+qMBK0PX0VLQikM 2026-06-21T13:41:15Z 12695 585 15 http://arxiv.org/abs/2605.17293v1 Task Capability Improvement Algorithm for Collaborative Manipulators 2026-05-17T07:13:51Z This work introduces a cooperative task capability improvement utilizing additional moments. The manipulators apply forces at the object's grasp point. Applying forces at a point other than the object's center of gravity produces undesired moments. The undesired moment acts as an additional moment. It improves the capability of an individual manipulator and, hence, the entire collaborative group. Any improvements in task capability directly add up to the object and transportation capability. The group's enhanced capability also helps achieve optimal capability, optimal resource allocation, and maximum fault tolerance in object manipulation. Our simulation results show an improvement in the capability of 5.86 \% compared to when no moment is used to enhance the capability of the manipulators. 2026-05-17T07:13:51Z Keshab Patra Arpita Sinha Anirban Guha http://arxiv.org/abs/2605.17292v1 MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation 2026-05-17T07:12:04Z Multi-agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi-agent LLM framework where each agent is equipped with a Metacognitive Self-Assessment Unit that evaluates task-capability alignment before execution. The framework introduces three contributions: (1) a self-assessment mechanism that estimates per-task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low-confidence tasks to better-suited agents through cross-agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent's competence model via cybernetic feedback. Experiments on our constructed MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy -- 8.7% above the best routing baseline -- while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance. 2026-05-17T07:12:04Z 6 pages, submitted to IEEE SMC 2026 Chenyu Wang Yang Shu http://arxiv.org/abs/2605.12824v2 Mechanism Plausibility in Generative Agent-Based Modeling 2026-05-17T05:34:27Z Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale. 2026-05-12T23:46:39Z Accepted at ACM FAccT 2026 Patrick Zhao David Huu Pham Nicholas Vincent 10.1145/3805689.3812388 http://arxiv.org/abs/2605.17206v1 Bimodal Synchronization Performance: Why Noise and Sparse Connectivity Can Improve Collective Timing 2026-05-17T00:38:49Z Pulse-coupled oscillator models inspired by firefly synchronization are widely used to study decentralized time coordination in distributed systems. We analyze a discrete-time, discrete-phase firefly-inspired synchronization model and show that collective synchrony emerges only near a critical balance between the quorum threshold (fraction of pulsing neighbors required to trigger a phase update) and the pulse duration (how long agents remain detectable to others). Within this parameter region, the system exhibits bimodal performance: it either reaches near-perfect synchronization or becomes trapped in stable multi-cluster states, where symmetrically phase-offset subgroups mutually reinforce one another and prevent global synchrony. Our analysis shows that reducing connectivity or introducing noise suppresses these low-performance states by breaking such symmetric interactions, indicating that highly connected or noiseless systems are not necessarily optimal for collective synchronization. 2026-05-17T00:38:49Z Till Aust Tianfu Zhang Andreagiovanni Reina Heiko Hamann http://arxiv.org/abs/2605.18890v1 Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits 2026-05-17T00:21:53Z The scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a "butterfly effect." Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled. We support this position with two case studies: a repeated Prisoner's Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions. 2026-05-17T00:21:53Z Jinyi Ye Lei Cao Ding Chen Emilio Ferrara http://arxiv.org/abs/2605.17193v1 Multi-LLM Systems Exhibit Robust Semantic Collapse 2026-05-16T23:29:42Z Whether machines can originate novel content has been debated for nearly two centuries, from Lovelace's assertion that no engine can "originate anything" to Turing's question of whether a machine can amplify ideas brought in from outside. Multi-large language model (LLM) systems, increasingly deployed for autonomous generation, reopen this question empirically. Here we show that such systems, operating in closed loops, exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. Across model families, extended simulations of 200 to 1,000 rounds, the pattern remains consistent. Twelve intervention strategies, spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning, fail to restore semantic diversity. Mechanistic analyses suggest that semantic collapse is not explained by alignment or conformity biases, but is consistent with intrinsic properties of autoregressive generation. Our results point to fundamental constraints in the ability of multi-LLM systems to sustain open-ended knowledge production in closed-loop settings. 2026-05-16T23:29:42Z 64 pages, 8 figures, 7 tables; includes Supplementary Information Weiyi Kong Shiyang Lai Jinghua Piao James Evans http://arxiv.org/abs/2502.16691v2 Responsible Federated LLMs via Safety Filtering and Constitutional AI 2026-05-16T22:13:25Z Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe and trustworthy responses, remains underexplored in this context. In FedLLM, client-side training data may contain harmful content, resulting in unsafe LLMs that can generate inappropriate responses. Aggregating such models into a global model and redistributing it to clients risks the widespread deployment of unsafe LLMs. To address this, we incorporate two well-established RAI techniques into FedLLM: safety filtering and constitutional AI. Our experiments show that these methods significantly improve LLM safety, achieving over 20% improvement on AdvBench. 2025-02-23T19:12:10Z Accepted at the 6th Workshop on Trustworthy NLP (TrustNLP), ACL 2026 Eunchung Noh Jeonghun Baek http://arxiv.org/abs/2605.17169v1 Responsible Agentic AI Requires Explicit Provenance 2026-05-16T21:56:33Z Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional. 2026-05-16T21:56:33Z Under Review Jinwei Hu Xinmiao Huang Qisong He Youcheng Sun Yi Dong Xiaowei Huang http://arxiv.org/abs/2605.17159v1 MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop 2026-05-16T21:18:39Z Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments. 2026-05-16T21:18:39Z 18 pages, 5 figures Diego Gosmar Giovanni Zenezini http://arxiv.org/abs/2605.17065v1 PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning 2026-05-16T16:15:59Z Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning. 2026-05-16T16:15:59Z Sikuan Yan Sicheng Dong Haotong Wang Ercong Nie Yilun Liu Jinhe Bi Yingjie Xu Susanna Schwarzmann Riccardo Trivisonno Volker Tresp Yunpu Ma http://arxiv.org/abs/2605.23986v1 MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing 2026-05-16T13:11:47Z Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS. 2026-05-16T13:11:47Z 12 pages. Extended version with appendix as supplemental material. Submitted to VLDB Han Chen Zining Zhang Wenqi Pei Bingsheng He Ming Wu Jason Zeng Michael Heinrich Wei Wu Hongbao Zhang http://arxiv.org/abs/2605.16855v1 Lifelong LaCAM with Local Guidance for Lifelong MAPF 2026-05-16T07:39:21Z Local guidance has recently proven to be a powerful driver of empirical performance in real-time, suboptimal multi-agent pathfinding (MAPF), improving the scalable configuration-based solver LaCAM. By injecting informative spatiotemporal cues around each agent, local guidance mitigates congestion, reduces waiting, and remains scalable enough even with tight time budgets, yielding state-of-the-art performance for one-shot MAPF. This study asks whether the same benefits can be lifted to the lifelong setting (LMAPF), where tasks arrive continuously and improvements in per-step plans can increase task completion throughput over long horizons. We propose LLLG, a Lifelong version of LaCAM enhanced with Local Guidance, which employs a receding-horizon windowed planning framework and warm-starts guidance from the previous solution at each timestep. Our method scales effectively, maintains high throughput even in compact, dense environments, and surpasses existing planners, thereby pushing the frontier of real-time, lifelong MAPF. 2026-05-16T07:39:21Z 10 pages, 11 figures, accepted to SoCS 2026 Tomoki Arita Keisuke Okumura http://arxiv.org/abs/2605.09395v2 Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning 2026-05-16T07:36:34Z In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence. 2026-05-10T07:47:09Z 18 pages, 12 figures, 6 tables. Preprint Lin Li Jiawei Huang Qihao Quan Dan Li Boxin Li Xiao Zhang Erli Meng Wenjie Feng Jian Lou See-Kiong Ng http://arxiv.org/abs/2603.00876v2 BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning 2026-05-16T06:59:19Z Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect; they can cause equipment damage or experimental failure. We propose BioProAgent, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous Design-Verify-Rectify workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by Semantic Symbol Grounding, reducing token consumption by ~6* through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. Code: https://github.com/YuyangSunshine/bioproagent | Website: https://yuyangsunshine.github.io/BioPro-Project. 2026-03-01T02:36:01Z Yuyang Liu Jingya Wang Liuzhenghao Lv Yonghong Tian http://arxiv.org/abs/2605.09341v2 SkillMAS: Skill Co-Evolution with LLM-based Multi-Agent System 2026-05-16T05:37:45Z Large language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi-agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis-specialization. We present SkillMAS, a non-parametric framework for adaptive specialization in multi-agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence-gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command-line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post-deployment specialization is attributed, updated, and applied. 2026-05-10T05:43:12Z 21 pages, 2 figures Shuai Pan Yixiang Liu Jiaye Gao Te Gao Weiwen Liu Jianghao Lin Zhihui Fu Jun Wang Weinan Zhang Yong Yu