https://arxiv.org/api/c6CcFA4Gb50K6PIvHAaimDoATuw2026-06-22T16:15:54Z5451036015http://arxiv.org/abs/2606.09777v2AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning2026-06-14T16:11:56ZForce and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.2026-06-08T17:38:57ZHong LiYue XuYihan TangYankang DongChenyuan LiuChenyang YuXuyang LiSiyuan HuangYujun ShenNan XueYong-Lu Lihttp://arxiv.org/abs/2606.15846v1FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds2026-06-14T15:02:07ZDeep reinforcement learning has shown strong potential for robot navigation, but its practical deployment is still limited by the long wall-clock cost of policy training. This paper presents FlashNav, a GPU-first framework for ultra-fast range-based robot navigation training. To the best of our knowledge, FlashNav is the first DRL-based robot navigation framework that reaches seconds-level policy training, with the fastest deployable policy trained in less than 20 seconds. The key idea is to align simulation with the navigation MDP: FlashNav preserves the essential components for velocity-level navigation, including occupancy geometry, range sensing, goal-conditioned control, robot motion dynamics, collision handling, termination, and reset, while removing unnecessary rendering and high-fidelity physical details from the training loop. Built on a batched bitmap simulator and a fully GPU-resident training pipeline with our FastDSAC learner, FlashNav generates massive parallel navigation transitions entirely on GPU. Experiments on TurtleBot2 and Unitree Go2 show that FlashNav achieves a 100\% success-rate below 20 seconds on an RTX 5090 and remains within tens of seconds across desktop GPUs. The learned policies further transfer to physical wheeled and legged robots in static and dynamic indoor scenes, demonstrating that DRL-based navigation can be trained at seconds-level speed while preserving deployable obstacle-avoidance behavior.2026-06-14T15:02:07Z15 pages, 4 figuresShanze WangYiwei QianXinming ZhangJun XueSiwei ChengXianghui WangQingyuan HuXiaoyu ShenWei Zhanghttp://arxiv.org/abs/2511.15645v4FDIO: Frequency Decomposed Inertial Odometry2026-06-14T13:37:01ZPedestrian inertial odometry (PIO) estimates autonomous pedestrian motion using only acceleration and angular velocity measurements collected by an inertial measurement unit (IMU), making it highly valuable for consumer level localization applications. However, under a dual device acquisition setting, IMU signals collected by a freely carried mobile device are inherently composite signals in which the global motion of the human torso is coupled with perturbations induced by local limb motion. This coupling makes accurate human motion modeling more challenging. To address this issue, this paper proposes frequency decomposed inertial odometry (FDIO). The proposed method first decomposes input IMU signals into low frequency and high frequency components using a Laplacian pyramid. It then adopts a Mamba module to model long range motion information from the low frequency component and uses a multi scale convolution module to extract fine grained local dynamic features from the high frequency component. Experiments on five public PIO datasets show that FDIO achieves an average absolute trajectory error of 3.221~m and an average relative trajectory error of 2.550~m, reducing the errors by 33.3\% and 16.7\% compared with the RoNIN ResNet baseline, respectively. These results validate the effectiveness of the proposed frequency decomposition strategy. To the best of our knowledge, this work is among the first efforts to introduce Mamba and a frequency decomposition architecture into inertial odometry.2025-11-19T17:29:27ZShanshan ZhangLiqin WuWenying CaoLingxiang ZhengYu Yanghttp://arxiv.org/abs/2411.19567v3DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation2026-06-14T13:35:45ZRecently, a number of simulation testing approaches have been proposed to generate diverse driving scenarios for autonomous driving systems (ADSs) testing. However, the behaviors of NPC vehicles in these scenarios generated by previous approaches are predefined and mutated before simulation execution, ignoring traffic signals and the behaviors of the Ego vehicle. Thus, a large number of the violations they found are induced by unrealistic behaviors of NPC vehicles, revealing no bugs of ADSs. Besides, the vast scenario search space of NPC behaviors during the iterative mutations limits the efficiency of previous approaches.
To address these limitations, we propose a novel scenario-based testing framework, DynNPC, to generate more violation scenarios induced by the ADS. Specifically, DynNPC allows NPC vehicles to dynamically generate behaviors using different driving strategies during simulation execution based on traffic signals and the real-time behavior of the Ego vehicle. We compare DynNPC with state-of-the-art scenario-based testing approaches. Our evaluation has demonstrated the effectiveness and efficiency of DynNPC in finding more violation scenarios induced by the ADS.2024-11-29T09:28:17ZAccepted by TOSEM 2026You LuYifan TianDingji WangBihuan ChenXin Penghttp://arxiv.org/abs/2509.00836v3One-Step Model Predictive Path Integral for Manipulator Motion Planning Using Configuration Space Distance Fields2026-06-14T12:23:11ZMotion planning for robotic manipulators is a fundamental problem in robotics. Classical optimization-based methods typically rely on the gradients of signed distance fields (SDFs) to impose collision-avoidance constraints. However, these methods are susceptible to local minima and may fail when the SDF gradients vanish. Recently, Configuration Space Distance Fields (CDFs) have been introduced, which directly model distances in the robot's configuration space. Unlike workspace SDFs, CDFs are differentiable almost everywhere and thus provide reliable gradient information. On the other hand, gradient-free approaches such as Model Predictive Path Integral (MPPI) control leverage long-horizon rollouts to achieve collision avoidance. While effective, these methods are computationally expensive due to the large number of trajectory samples, repeated collision checks, and the difficulty of designing cost functions with heterogeneous physical units. In this paper, we propose a framework that integrates CDFs with MPPI to enable direct navigation in the robot's configuration space. Leveraging CDF gradients, we unify the MPPI cost in joint-space and reduce the horizon to one step, substantially cutting computation while preserving collision avoidance in practice. We demonstrate that our approach achieves nearly 100% success rates in 2D environments and consistently high success rates in challenging 7-DOF Franka manipulator simulations with complex obstacles. Furthermore, our method attains control frequencies exceeding 750 Hz, substantially outperforming both optimization-based and standard MPPI baselines. These results highlight the effectiveness and efficiency of the proposed CDF-MPPI framework for high-dimensional motion planning.2025-08-31T13:09:24ZYulin LiTetsuro MiyazakiKenji Kawashimahttp://arxiv.org/abs/2606.15768v1LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies2026-06-14T12:06:58ZVision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.2026-06-14T12:06:58ZJialei ChenKai WangKang ChenShuaihang ChenFeng GaoWenhao TangZhiyuan LiWeilin LiuZhuyu YaoBoxun LiYuanbo XuChao Yuhttp://arxiv.org/abs/2605.27284v2FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies2026-06-14T11:23:34ZVision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/2026-05-26T17:01:10Z26 pages, 7 figures, 25 tablesXintong HuXuhong HuangJinyu ZhangYutong YaoYuchong SunQiuyue WangMingsheng LiSicheng XieYitao LiuJunhao ChenYixuan ChenYingming ZhengShuai BaiTao Yuhttp://arxiv.org/abs/2606.15714v1Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models2026-06-14T10:04:08ZVision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.2026-06-14T10:04:08ZHanyang ChenHongliang LiJiarui CaoYang LiYang JiangHaonan WenKaiyu HuangShengnan GuoHuaiyu Wanhttp://arxiv.org/abs/2511.18960v4AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention2026-06-14T09:45:21ZVision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making. The project page is available at https://liauto-dsr.github.io/AVA-VLA-Page.2025-11-24T10:22:28ZAccepted at CVPR 2026 (Highlight)Lei XiaoJifeng LiJuntao GaoFeiyang YeYan JinJingjing QianJing ZhangYong WuXiaoyuan Yuhttp://arxiv.org/abs/2606.15691v1Can Causal Models Enhance Robot Navigation? Online Causal Adaptation for Real-Robot Navigation2026-06-14T09:11:01ZCausality in robotics aims to produce more interpretable and flexible robot behaviours by enabling robots to predict the consequences of their actions; however, deploying causal models with existing systems (e.g., navigation) operating in real environments remains understudied. This paper addresses the challenging problem of transferring causal models in real-robot experiments for a navigation scenario. We study this problem in two ways: (i) using the causal model as an offline evaluation module that predicts the competence of recorded real-robot navigation trajectories and relates it to quantitative navigation performance, and (ii) using the causal model as an online adaptation module that intervenes when the predicted competence of the default navigation is low. We validate our approach in a physical service robot that patrols around corridors. We show that the predicted competence correlates positively with path efficiency, and negatively with path irregularities (suboptimal behaviour). The model predictions also show strong agreement with human annotations (Cohen's kappa value of 0.88). In online experiments, the proposed method improves navigation performance in complex scenarios such as cornering and obstacle avoidance, yielding higher predicted competence and better navigation metrics than the default navigation baseline. In simpler scenarios, where the baseline already performs near-optimally, the causal adaptation provides limited benefit. These results indicate that causal models are particularly effective in enhancing navigation under increased task complexity. Overall, our results demonstrate that causal models developed for behavioural interpretation can be successfully integrated into real-robot navigation systems.2026-06-14T09:11:01ZAccepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)Zhitao LiangAlex MitrevskiEmmanuel DeanKarinne Ramirez-Amarohttp://arxiv.org/abs/2606.15685v1Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning2026-06-14T09:02:49ZEmbodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.2026-06-14T09:02:49Z13 pages, 5 figuresShuaike ZhangShaokun WangHaoyu TangJianlong WuLiqiang Niehttp://arxiv.org/abs/2606.15654v1PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty2026-06-14T07:46:28ZReal-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.2026-06-14T07:46:28ZWenjing TangXuanjin JinYuan LiuRenming HuangCewu LuPanpan Caihttp://arxiv.org/abs/2606.15645v1TO-SoFiT: Topology Optimization of Hydraulic Soft Fish Tail Design for programmable undulating locomotion2026-06-14T07:22:00ZSoft robots leverage compliant materials to generate motion through controlled elastic deformation, making them ideal for delicate tasks such as underwater exploration and biomimetic marine systems. Although hydraulic/pneumatic actuation remains pivotal for such systems, the lack of systematic design frameworks has hindered the development of robots capable of complex 3D motion, such as fish-like swimming. This work introduces a topology optimization method to automate the design of a hydraulic soft fish tail, explicitly addressing the design-dependent coupling between fluidic actuation and structural deformation. We use a Darcy law-based model augmented with a drainage term to simulate spatially varying hydraulic pressure loads, translating these into consistent nodal forces via finite element analysis. The employed robust multi-criteria optimization formulation balances deformation efficiency, fluid-structure interaction, geometric manufacturability, and required stiffness for optimizing a bioinspired soft fish tail for 3D swimming kinematics. The optimized tail topology is incorporated into a pneumatic network actuator and computationally validated under various hydraulic loads, achieving tunable undulatory amplitudes and multiaxis bending for depth adjustment. The optimized 2D tail outperforms its rectangular counterpart. By cascading optimized tail segments, we demonstrate programmable swimming patterns in soft robotic fish tails at different hydraulic loads. This work advances the systematic codesign of hydraulic actuators and soft structures, offering a pathway to automate underwater robots with optimized design and vertebrate-like agility in confined aquatic environments. Our implementations and simulations are publicly available at 'https://github.com/PrabhatIn/TO-SoFiT'.2026-06-14T07:22:00ZAccepted for publication at the Advances in Robotics (AIR), 2025, IIT JodhpurA PadmaprabhanAmal ShajiPrabhat Kumarhttp://arxiv.org/abs/2606.15631v1Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time2026-06-14T06:48:01ZExtending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.2026-06-14T06:48:01Zhttps://recap-robot.github.io/Jeongeun ParkJuhan ParkTaekyung KimSungjoon ChoiDongyoon HanSangdoo Yunhttp://arxiv.org/abs/2606.15594v1Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC2026-06-14T04:34:03ZWe present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.2026-06-14T04:34:03ZDevesh NathAnutam SrinivasanHaoran YinRuitong JiangJeffrey FangGlen Chou