https://arxiv.org/api/29MGT0J44KKiHEv9CB39K76eNmc 2026-06-13T18:00:35Z 54141 105 15 http://arxiv.org/abs/2606.12142v1 AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents 2026-06-10T14:34:24Z

Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.

2026-06-10T14:34:24Z Ke Li Jianfei Yang Luyao Zhang Guo Yu Chengwei Yan Yuan Ding Di Wang Nan Luo Gang Liu Xiao Gao Quan Wang http://arxiv.org/abs/2606.12112v1 PEBRE: An Open-Hardware Compute and Perception Add-On for the Pepper Robot 2026-06-10T14:06:30Z

This paper presents the design, development, and experimental verification of PEBRE, an open-hardware add-on for fast software development on the Pepper Robot. Our project enhances Pepper's computational and perception capabilities by integrating external components such as a Jetson Orin Nano, Logitech BRIO, Intel RealSense D435i, Samson UB1, and RØDE VideoMicro II. Our results show that the new hardware considerably improved Pepper's perception abilities and computational power. This development contributes to the community by implementing an open hardware and open-source modular add-on to the Pepper robot and keeping this relevant research platform functional beyond its expected lifespan. With PEBRE, we aim to facilitate faster software development and more efficient integration of external components, ultimately enhancing the capabilities of the Pepper robot.

2026-06-10T14:06:30Z Malte Kuhlmann Ignacio Bugueno-Cordova Emil Alms Javier Ruiz-del-Solar Nicolás Navarro-Guerrero http://arxiv.org/abs/2606.12109v1 Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning 2026-06-10T14:03:52Z

Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.

2026-06-10T14:03:52Z Chuanke Pang Junyi Huang Zhijun Zhao Yaobing Wang Kun Xu Xilun Ding http://arxiv.org/abs/2606.12105v1 DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model 2026-06-10T13:59:07Z

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

2026-06-10T13:59:07Z 17 pages, 8 figures Pankhuri Vanjani Zhuoyue Li Jakub Suliga Moritz Reuss Gianluca Geraci Xinkai Jiang Rudolf Lioutikov http://arxiv.org/abs/2606.12499v1 Action-Effect Memory Pretraining for Robot Manipulation 2026-06-10T13:58:14Z

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

2026-06-10T13:58:14Z Yijing Zhou Qiwei Liang Sitong Zhuang Jiaxi Li Xianpeng Wang Boyang Cai Yunyang Mo Renjing Xu http://arxiv.org/abs/2606.08530v2 GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-10T13:54:28Z

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

2026-06-07T09:23:16Z Yuan Zhang Shiqi Zhang Yedong Shen Shuai Dong Jiajun Deng Xin Zhang Yuxuan Gao Jiajia Wu Xin Nie Zhiyuan Cheng Jianmin Ji Yanyong Zhang Xingyi Zhang Jia Pan http://arxiv.org/abs/2601.10724v4 Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty 2026-06-10T13:46:52Z

Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.

2025-12-23T13:39:29Z Rishabh Dev Yadav http://arxiv.org/abs/2606.12070v1 Fibration Trees: A Unified Approach to Multi-Robot Motion Planning 2026-06-10T13:33:45Z

State space projections and decompositions have emerged as powerful tools to tackle the curse of dimensionality in high-dimensional, multi-robot motion planning problems. However, existing methods lack a unified framework which seamlessly handles combinations of projections (prioritization or task-space) and decompositions (parallel or decoupled subspaces). To fill this gap, we introduce fibration trees, which are trees consisting of state spaces as nodes and fibrations as edges, whereby a fibration models a projection from a higher-dimensional space to a lower-dimensional (or simplified) space. By modeling projections as fibrations, we unify sequential prioritization, parallel decomposition, and task-space projections under a single, coherent formalism. Building on this, we develop the rapidly-exploring random fibration trees (Fibration-RRT) planner, a sampling-based motion planner that generalizes strategies from quotient-space RRT (for sequential prioritizations) and discrete RRT (for parallel decompositions), while allowing the inclusion of task-space projections. Fibration-RRT operates on user-defined fibration trees and is proven to be probabilistically complete. To test the generality and efficiency of Fibration-RRT, we provide an open-source implementation and conduct experiments on 32 scenarios using multi robot teams with up to 96 degrees of freedom. Our results indicate that Fibration-RRT efficiently solves high-dimensional problems by exploiting user-defined fibration trees, thereby establishing fibration trees as a powerful, unified framework for multi-robot motion planning.

2026-06-10T13:33:45Z 23 pages, 12 figures Andreas Orthey Florian T. Pokorny Lydia E. Kavraki http://arxiv.org/abs/2503.06578v2 Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning 2026-06-10T13:33:10Z

The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.

2025-03-09T12:16:30Z Canlun Zheng Zhanyu Guo Zikang Yin Chunyu Wang Zhikun Wang Shiyu Zhao http://arxiv.org/abs/2606.12497v1 $μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models 2026-06-10T13:26:40Z

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

2026-06-10T13:26:40Z 34 pages, 20 figures, 9 tables Egor Cherepanov Nikita Kachaev Daniil Zelezetsky Aydar Bulatov Artem Pshenitsyn Yuri Kuratov Alexey Skrynnik Aleksandr I. Panov Alexey K. Kovalev http://arxiv.org/abs/2606.12048v1 Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom 2026-06-10T13:12:45Z

High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Source code and project page: https://github.com/balazsgyenes/kirurc

2026-06-10T13:12:45Z 8 pages, 5 figures, accepted to IEEE Robotics and Automation Letters (RAL) IEEE Robotics and Automation Letters (Volume: 10, Issue: 8, August 2025) Balázs Gyenes Nikolai Franke Paul Maria Scheikl Pit Henrich Rayan Younis Gerhard Neumann Martin Wagner Franziska Mathis-Ullrich 10.1109/LRA.2025.3585357 http://arxiv.org/abs/2606.12042v1 KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility 2026-06-10T13:07:03Z

Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.

2026-06-10T13:07:03Z Accepted by IEEE Transactions on Automation Science and Engineering (T-ASE) Zhiming Xu Haodong Yang Chengju Liu Qijun Chen Chenpeng Yao http://arxiv.org/abs/2606.12028v1 VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network 2026-06-10T12:51:25Z

Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: https://scaling-group.github.io/vicx/.

2026-06-10T12:51:25Z The first two authors contributed equally to this work Song Chen Linyan Xiang Ying Zhou Liu Yang http://arxiv.org/abs/2606.12027v1 Learning Unions of Convex Sets via Invertible Latent Decomposition for Path Planning 2026-06-10T12:49:47Z

Collision-free path planning in cluttered, real-world environments relies on a representation of the collision-free space, and existing representations broadly fall into two categories. Explicit representations, such as unions of convex sets, can be plugged into optimization-based planners as hard collision-free constraints, but their parameters scale poorly with configuration-space dimension. Implicit representations, by contrast, are flexible and scale well to complex geometries, yet typically lack such guarantees. We bridge this gap with ILD (Invertible Latent Decomposition), a framework that jointly learns an invertible mapping and a union of explicit convex polytopes in the resulting latent space. Planning is carried out over these latent convex sets, and the invertible mapping decodes the resulting paths back to the original configuration space while preserving feasibility with respect to the refined explicit safe regions. We further propose Visibility-Guided Sampling (VGS) to keep the convex sets connected for path planning. Across 2D navigation, 6-DoF, and 14-DoF manipulation environments, ILD achieves broader coverage, better inter-set connectivity, and higher path-planning success rates than prior baselines, with zero observed false positives after test-time refinement. On a 14-DoF bimanual manipulator, we further demonstrate real-time collision-free planning, with test-time refinement adapting to scene-geometry changes during real-world deployment on a single 6-DoF arm.

2026-06-10T12:49:47Z Taerim Yoon Dongho Kang Kisang Park Junha Cha Stelian Coros Sungjoon Choi http://arxiv.org/abs/2606.12019v1 MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs 2026-06-10T12:44:25Z

Autonomous surface vehicles offer an efficient solution for environmental cleanup as well as search and rescue operations in open waters. Targets in these settings drift continuously, so efficient search must balance exploration of unobserved regions with tracking of known targets. However, most target tracking and pursuit scenarios consider simple guidance behaviours and short-term predictions for decision-making. In this letter, we address the problem of search and capture of multiple drifting targets, such as litter, in dynamic environments, using a hybrid planning framework. A key aspect of our strategy is a spatiotemporal informative planning method based on model predictive path integral (MPPI) control, a sampling-based model predictive control approach. The planner directly generates kinematic-level commands by optimising continuous trajectories over long horizons. A multi-objective cost balances search and tracking objectives while ensuring safe, feasible trajectories. In the interception stage, we switch to a pure pursuit guidance controller for the physical capture of moving targets. Experiments show that our planner outperforms the chosen planning baselines. Finally, we validate our approach in field trials with an ASV.

2026-06-10T12:44:25Z Sanjeev Ramkumar Sudha Marija Popović Erlend M. Coates