https://arxiv.org/api/cFaHP2VRZmKL+qMBK0PX0VLQikM2026-06-21T13:41:15Z1269558515http://arxiv.org/abs/2605.17293v1Task Capability Improvement Algorithm for Collaborative Manipulators2026-05-17T07:13:51ZThis work introduces a cooperative task capability improvement utilizing additional moments. The manipulators apply forces at the object's grasp point. Applying forces at a point other than the object's center of gravity produces undesired moments. The undesired moment acts as an additional moment. It improves the capability of an individual manipulator and, hence, the entire collaborative group. Any improvements in task capability directly add up to the object and transportation capability. The group's enhanced capability also helps achieve optimal capability, optimal resource allocation, and maximum fault tolerance in object manipulation. Our simulation results show an improvement in the capability of 5.86 \% compared to when no moment is used to enhance the capability of the manipulators.2026-05-17T07:13:51ZKeshab PatraArpita SinhaAnirban Guhahttp://arxiv.org/abs/2605.17292v1MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation2026-05-17T07:12:04ZMulti-agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi-agent LLM framework where each agent is equipped with a Metacognitive Self-Assessment Unit that evaluates task-capability alignment before execution. The framework introduces three contributions: (1) a self-assessment mechanism that estimates per-task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low-confidence tasks to better-suited agents through cross-agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent's competence model via cybernetic feedback. Experiments on our constructed MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy -- 8.7% above the best routing baseline -- while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance.2026-05-17T07:12:04Z6 pages, submitted to IEEE SMC 2026Chenyu WangYang Shuhttp://arxiv.org/abs/2605.12824v2Mechanism Plausibility in Generative Agent-Based Modeling2026-05-17T05:34:27ZLarge language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.2026-05-12T23:46:39ZAccepted at ACM FAccT 2026Patrick ZhaoDavid Huu PhamNicholas Vincent10.1145/3805689.3812388http://arxiv.org/abs/2605.17206v1Bimodal Synchronization Performance: Why Noise and Sparse Connectivity Can Improve Collective Timing2026-05-17T00:38:49ZPulse-coupled oscillator models inspired by firefly synchronization are widely used to study decentralized time coordination in distributed systems. We analyze a discrete-time, discrete-phase firefly-inspired synchronization model and show that collective synchrony emerges only near a critical balance between the quorum threshold (fraction of pulsing neighbors required to trigger a phase update) and the pulse duration (how long agents remain detectable to others). Within this parameter region, the system exhibits bimodal performance: it either reaches near-perfect synchronization or becomes trapped in stable multi-cluster states, where symmetrically phase-offset subgroups mutually reinforce one another and prevent global synchrony. Our analysis shows that reducing connectivity or introducing noise suppresses these low-performance states by breaking such symmetric interactions, indicating that highly connected or noiseless systems are not necessarily optimal for collective synchronization.2026-05-17T00:38:49ZTill AustTianfu ZhangAndreagiovanni ReinaHeiko Hamannhttp://arxiv.org/abs/2605.18890v1Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits2026-05-17T00:21:53ZThe scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a "butterfly effect." Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled.
We support this position with two case studies: a repeated Prisoner's Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions.2026-05-17T00:21:53ZJinyi YeLei CaoDing ChenEmilio Ferrarahttp://arxiv.org/abs/2605.17193v1Multi-LLM Systems Exhibit Robust Semantic Collapse2026-05-16T23:29:42ZWhether machines can originate novel content has been debated for nearly two centuries, from Lovelace's assertion that no engine can "originate anything" to Turing's question of whether a machine can amplify ideas brought in from outside. Multi-large language model (LLM) systems, increasingly deployed for autonomous generation, reopen this question empirically. Here we show that such systems, operating in closed loops, exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. Across model families, extended simulations of 200 to 1,000 rounds, the pattern remains consistent. Twelve intervention strategies, spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning, fail to restore semantic diversity. Mechanistic analyses suggest that semantic collapse is not explained by alignment or conformity biases, but is consistent with intrinsic properties of autoregressive generation. Our results point to fundamental constraints in the ability of multi-LLM systems to sustain open-ended knowledge production in closed-loop settings.2026-05-16T23:29:42Z64 pages, 8 figures, 7 tables; includes Supplementary InformationWeiyi KongShiyang LaiJinghua PiaoJames Evanshttp://arxiv.org/abs/2502.16691v2Responsible Federated LLMs via Safety Filtering and Constitutional AI2026-05-16T22:13:25ZRecent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe and trustworthy responses, remains underexplored in this context. In FedLLM, client-side training data may contain harmful content, resulting in unsafe LLMs that can generate inappropriate responses. Aggregating such models into a global model and redistributing it to clients risks the widespread deployment of unsafe LLMs. To address this, we incorporate two well-established RAI techniques into FedLLM: safety filtering and constitutional AI. Our experiments show that these methods significantly improve LLM safety, achieving over 20% improvement on AdvBench.2025-02-23T19:12:10ZAccepted at the 6th Workshop on Trustworthy NLP (TrustNLP), ACL 2026Eunchung NohJeonghun Baekhttp://arxiv.org/abs/2605.17169v1Responsible Agentic AI Requires Explicit Provenance2026-05-16T21:56:33ZAgentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.2026-05-16T21:56:33ZUnder ReviewJinwei HuXinmiao HuangQisong HeYoucheng SunYi DongXiaowei Huanghttp://arxiv.org/abs/2605.17159v1MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop2026-05-16T21:18:39ZDocument processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.2026-05-16T21:18:39Z18 pages, 5 figuresDiego GosmarGiovanni Zenezinihttp://arxiv.org/abs/2605.17065v1PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning2026-05-16T16:15:59ZMemory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.2026-05-16T16:15:59ZSikuan YanSicheng DongHaotong WangErcong NieYilun LiuJinhe BiYingjie XuSusanna SchwarzmannRiccardo TrivisonnoVolker TrespYunpu Mahttp://arxiv.org/abs/2605.23986v1MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing2026-05-16T13:11:47ZMemory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS.2026-05-16T13:11:47Z12 pages. Extended version with appendix as supplemental material. Submitted to VLDBHan ChenZining ZhangWenqi PeiBingsheng HeMing WuJason ZengMichael HeinrichWei WuHongbao Zhanghttp://arxiv.org/abs/2605.16855v1Lifelong LaCAM with Local Guidance for Lifelong MAPF2026-05-16T07:39:21ZLocal guidance has recently proven to be a powerful driver of empirical performance in real-time, suboptimal multi-agent pathfinding (MAPF), improving the scalable configuration-based solver LaCAM. By injecting informative spatiotemporal cues around each agent, local guidance mitigates congestion, reduces waiting, and remains scalable enough even with tight time budgets, yielding state-of-the-art performance for one-shot MAPF. This study asks whether the same benefits can be lifted to the lifelong setting (LMAPF), where tasks arrive continuously and improvements in per-step plans can increase task completion throughput over long horizons. We propose LLLG, a Lifelong version of LaCAM enhanced with Local Guidance, which employs a receding-horizon windowed planning framework and warm-starts guidance from the previous solution at each timestep. Our method scales effectively, maintains high throughput even in compact, dense environments, and surpasses existing planners, thereby pushing the frontier of real-time, lifelong MAPF.2026-05-16T07:39:21Z10 pages, 11 figures, accepted to SoCS 2026Tomoki AritaKeisuke Okumurahttp://arxiv.org/abs/2605.09395v2Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning2026-05-16T07:36:34ZIn this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.2026-05-10T07:47:09Z18 pages, 12 figures, 6 tables. PreprintLin LiJiawei HuangQihao QuanDan LiBoxin LiXiao ZhangErli MengWenjie FengJian LouSee-Kiong Nghttp://arxiv.org/abs/2603.00876v2BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning2026-05-16T06:59:19ZLarge language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect; they can cause equipment damage or experimental failure. We propose BioProAgent, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous Design-Verify-Rectify workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by Semantic Symbol Grounding, reducing token consumption by ~6* through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. Code: https://github.com/YuyangSunshine/bioproagent | Website: https://yuyangsunshine.github.io/BioPro-Project.2026-03-01T02:36:01ZYuyang LiuJingya WangLiuzhenghao LvYonghong Tianhttp://arxiv.org/abs/2605.09341v2SkillMAS: Skill Co-Evolution with LLM-based Multi-Agent System2026-05-16T05:37:45ZLarge language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi-agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis-specialization. We present SkillMAS, a non-parametric framework for adaptive specialization in multi-agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence-gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command-line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post-deployment specialization is attributed, updated, and applied.2026-05-10T05:43:12Z21 pages, 2 figuresShuai PanYixiang LiuJiaye GaoTe GaoWeiwen LiuJianghao LinZhihui FuJun WangWeinan ZhangYong Yu