https://arxiv.org/api/XyJ8JostpeYK01ApOtBprRUxVto2026-06-22T08:56:35Z5451027015http://arxiv.org/abs/2606.16935v1CrossMaps: Confidence-Aware Open-Vocabulary Semantic Mapping for Rover Navigation2026-06-15T16:35:01ZRovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.2026-06-15T16:35:01ZIEEE International Conference on Robotics and Automation (ICRA) 2026: ROSE International Workshop on Robotics Software Engineering, June 01, 2026, Vienna, AustriaJan-Niklas KleinSona GhahremaniChristian Medeiros AdrianoHolger Giesehttp://arxiv.org/abs/2502.19544v3Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data2026-06-15T16:24:19ZLeveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves nearly twice the aggregate score of learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.2025-02-26T20:34:29ZYi ZhaoAidan ScannellWenshuai ZhaoYuxin HouTianyu CuiLe ChenDieter BüchlerArno SolinJuho KannalaJoni Pajarinenhttp://arxiv.org/abs/2606.16902v1Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models2026-06-15T16:10:03ZThis work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking2026-06-15T16:10:03Z21 pages, 4 figures, 15 tables. Project page: https://ndb796.github.io/BinaryTracking ; Code and dataset: https://github.com/ndb796/BinaryTrackingDongbin NaChanwoo KimSoonbin RhoGiyun ChoiGangbok LeeDooyoung Honghttp://arxiv.org/abs/2606.16888v1LOPAL: Local Performance-Aware Active Learning from Imperfect Demonstrations2026-06-15T16:00:01ZLearning from Demonstration (LfD) enables intuitive robot skill acquisition by allowing robots to learn directly from human task demonstrations. However, current methods often fail to address the fact that due to suboptimal and inconsistent human behavior, the quality of the demonstration can vary within each demonstration. Therefore, we introduce LOPAL (LOcal Performance-aware Active Learning), an active learning approach that leverages this local demonstration quality information. Our approach consists of two synergistic components. First, a local performance-driven LfD method uses a Gaussian Mixture Model (GMM) to encode both the demonstrated trajectories and their associated local quality assessments. This enables the generation of trajectories that outperform the imperfect demonstrations by utilizing complementary local data of high performance. Second, active data acquisition allows to improve beyond the imperfect demonstrations by collecting additional informative samples. In areas missing good data, the user is actively requested to provide corrections through a shared autonomy (SA) mechanism, while the robot autonomously executes the learned behavior. The efficacy of LOPAL was validated in both a simulation and a real-world experiment. The results from a real-world pipe inspection task showed that the proposed approach can achieve up to 27.31 % improvement in task performance while also reducing the effort required to collect the demonstrations.2026-06-15T16:00:01ZAccepted for publication in IEEE Robotics and Automation Letters (RAL), 2026Johannes HeidersbergerShail JadavDongheui Lee10.1109/LRA.2026.3698364http://arxiv.org/abs/2606.16881v1SGM-SLAM: Scene Graph Matching for Data-Efficient Distributed SLAM2026-06-15T15:53:59ZWe introduce a data-efficient distributed Simultaneous Localization and Mapping (SLAM) framework designed for a team of robots equipped with LiDAR, cameras, and inertial sensors. Our framework uses scene graph matching to identify inter-robot measurement constraints. Unlike prior approaches that rely on feature-level matching, our framework is the first to perform scene graph matching using only object labels and centroids. Our approach constructs a scene graph by using fused RGB-LiDAR point clouds to generate both a semantically segmented point cloud layer, and a layer of discrete bounded objects, to accompany estimated robot trajectories. Scene graph matching is performed collaboratively through exchanging and matching object data with neighboring robots. To maximize communication efficiency, we utilize a multi-step data exchange and optimization process. We demonstrate the effectiveness and efficiency of our approach using both simulation and real-world datasets collected by legged robots in indoor and outdoor environments.2026-06-15T15:53:59ZYewei HuangTixiao ShanAbhinav RajvanshiNiluthpol Chowdhury MithunYaxuan LiBrendan EnglotHan-Pang Chiuhttp://arxiv.org/abs/2601.08056v3The embodied brain: Bridging the brain, body, and behavior with biorealistic neuromechanical models2026-06-15T15:51:33ZAnimal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to understand algorithms for behavioral control. Computational models that embed artificial neural controllers within body models in simulated environments, are a powerful tool for this purpose. Here, we review advances in biorealistic neuromechanical models while also highlighting emerging opportunities ahead. We first show how these models enable inference of biophysical variables that are difficult to measure experimentally. Through systematic perturbation, one can generate new experimentally testable hypotheses through these models. We then examine how neuromechanical models facilitate the exchange between neuroscience, robotics, and machine learning, and showcase their applications in healthcare. We envision that coupling experimental studies with active probing of their neuromechanical surrogates will significantly accelerate progress in neuroscience.2026-01-12T22:57:45Z18 pages, 4 figures (including 1 graphical abstract), 1 tableSibo Wang-ChenPavan Ramdyahttp://arxiv.org/abs/2606.16876v1ExoTraj: A General Lower-limb Exoskeleton Assistance Policy for Complex Environments2026-06-15T15:50:29ZAdaptive torque prediction in dynamic exoskeleton scenarios requires expensive motion capture systems, which are infeasible in complex outdoor environments. Trajectory prediction has emerged as one of the effective approaches to address such an issue. However, the core challenges of exoskeleton trajectory prediction are twofold: establishing the mapping from multi-modal features to trajectory information; constructing the mapping from trajectory to torque. For the former, most existing methods perform only single-step prediction and neglect inter-subject trajectory variability, thereby limiting the trajectory optimization space and prediction generalization. To address this, this paper proposes a fast flow matching method that enables accurate trajectory prediction and better generalization for real-time performance, where trajectory generation errors and encoded observations are used to guide the training direction. For the second challenge, due to the high dynamics of the human-robot system and the strong coupling between perception and control, simple control methods struggle to achieve efficient assistance based on the predicted trajectory. This paper utilizes model predictive control and designs a novel optimization objective to optimize torque, ensuring the exoskeleton achieves comfortable and robust assistance. By integrating the above two components, the unified policy, denoted as ExoTraj, is developed to enable adaptive assistance in complex outdoor scenarios without high data acquisition cost. Experimental results show that compared to traditional methods, ExoTraj reduces cross-subject prediction error by 14.0% during the online phase and maintains robustness against external noise. Relative to the zero torque condition, ExoTraj decreases metabolic rate by 11.5-24.4%, heart rate by 1.7-19.5%, and peak muscle activation levels by 10.9-41.3%, respectively.2026-06-15T15:50:29Z28 pages, 19 figures, project page: https://xiaoyinliu0714.github.io/Home_ExoTraj/Xiao-Yin LiuGuotao LiLong SunXu LiangZeng-Guang Houhttp://arxiv.org/abs/2408.15919v4ReMoBot: Retrieval-Based Few-Shot Imitation Learning for Mobile Manipulation with Vision Foundation Models2026-06-15T15:44:08ZImitation learning (IL) algorithms typically distill demonstrations into parametric policies to mimic expert behavior. However, with limited data and partial observability, such as in egocentric mobile manipulation, existing methods often struggle to generate accurate actions. To address these challenges, we propose ReMoBot, a few-shot, trajectory-conditioned imitation learning framework that directly Retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Leveraging vision foundation models, ReMoBot identifies relevant expert demonstrations by combining state-level similarity, history-aware trajectory alignment, and action-sequence consistency to disambiguate perceptually similar observations. The agent then selects appropriate control commands based on these retrieved demonstrations in a fully training-free manner.
We evaluate ReMoBot on three mobile manipulation tasks using a Boston Dynamics Spot robot in both simulation and real-world settings. After benchmarking five approaches in simulation, we compare our method with two baselines trained directly on real-world data without sim-to-real transfer. With only 20 demonstrations per task, ReMoBot outperforms the baselines, achieving high success rates in Table Uncover (70%) and Gap Cover (80%), while also showing promising performance on the more challenging Curtain Open task in the real-world setting. Furthermore, ReMoBot generalizes across varying robot positions, object sizes, and material properties, highlighting its robustness in real-world deformable mobile manipulation. Additional details are available at: https://sites.google.com/view/remobot/home2024-08-28T16:33:21ZYuying ZhangWenyan YangFrancesco VerdojaVille KyrkiJoni Pajarinenhttp://arxiv.org/abs/2606.16856v1Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning2026-06-15T15:31:40ZConveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.2026-06-15T15:31:40ZICML 2026 (Oral)Tung M. LuuHwanhee KimYounghwan LeeChang D. Yoohttp://arxiv.org/abs/2510.07063v3Artists' Views on Robotics Involvement in Painting Productions2026-06-15T15:29:13ZAs robotic technologies evolve, their potential in artistic creation becomes an increasingly relevant topic of inquiry. This study explores how professional abstract artists perceive and experience co-creative interactions with an autonomous painting robotic arm. Eight artists engaged in six painting sessions -- three with a human partner, followed by three with the robot -- and subsequently participated in semi-structured interviews analyzed through reflexive thematic analysis. Human-human interactions were described as intuitive, dialogic, and emotionally engaging, whereas human-robot sessions felt more playful and reflective, offering greater autonomy and prompting for novel strategies to overcome the system's limitations. This work offers one of the first empirical investigations into artists' lived experiences with a robot, highlighting the value of long-term engagement and a multidisciplinary approach to human-robot co-creation.2025-10-08T14:28:30Z10 pages, 9 figures, submitted to RAM special issue: Arts and RoboticsFrancesca CocchellaNilay Roy ChoudhuryEric ChenPatrícia Alves-Oliveirahttp://arxiv.org/abs/2606.08059v2Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain2026-06-15T15:19:43ZHumanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.2026-06-06T08:46:44ZZifan WangYizhao LiTeli MaQiang ZhangYudong FanHao XuShuo YangJunwei Lianghttp://arxiv.org/abs/2601.08514v2Simplifying ROS2 controllers with a modular architecture for robot-agnostic reference generation2026-06-15T15:17:01ZThis paper introduces a novel modular architecture for ROS2 that decouples the logic required to acquire, validate, and interpolate references from the control laws that track them. The design includes a dedicated component, named Reference Generator, that receives references, in the form of either single points or trajectories, from external nodes (e.g., planners), and writes single-point references at the controller's sampling period via the existing ros2_control chaining mechanism to downstream controllers. This separation removes duplicated reference-handling code from controllers and improves reusability across robot platforms. We implement two reference generators: one for handling joint-space references and one for Cartesian references, along with a set of new controllers (PD with gravity compensation, Cartesian pose, and admittance controllers) and validate the approach on simulated and real Universal Robots and Franka Emika manipulators. Results show that (i) references are tracked reliably in all tested scenarios, (ii) reference generators reduce duplicated reference-handling code across chained controllers to favor the construction and reuse of complex controller pipelines, and (iii) controller implementations remain focused only on control laws.2026-01-13T12:55:07Z5 pages, 7 figuresDavide RisiVincenzo PetroneAntonio LangellaLorenzo PagliaraEnrico FerrentinoPasquale Chiacchiohttp://arxiv.org/abs/2606.16826v1ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies2026-06-15T15:08:42ZGeneralist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.2026-06-15T15:08:42ZHomepage: https://flageval-baai.github.io/AtomBenchPageZenan WuBingqing WeiLu LiuZheqi HeXi WangJiakang LiuZehui LiGuocai YaoJing-Shu ZhengXi YangYongtao Wanghttp://arxiv.org/abs/2606.16788v1SoK: Security and Privacy of Foundation-Model-Powered Robots2026-06-15T14:32:08ZFoundation models are reshaping robotics by enabling robots to interpret open-ended instructions, reason over multimodal contexts, and operate in complex, open-world environments. However, their integration also introduces security and privacy (S&P) risks that extend beyond the FMs themselves to embodied execution pipelines, supporting ecosystems, and broader governance impacts. Existing literature reviews provide valuable insights but often focus on specific FM types, risk categories, mitigation strategies, or trust boundaries. Consequently, the field lacks a unified structure for analyzing where risks originate, how they propagate across robotic systems, and where mitigations should intervene. To address this gap, we propose a progressive F-E-S-G structural boundary framework for analyzing the S&P of FM-powered robots. The framework comprises four layers: the Foundation model layer (F), Embodied system layer (E), Supporting ecosystem layer (S), and Governance impact layer (G). Building on this structure, we develop a multi-level taxonomy that organizes prior studies along three levels: F-E-S-G trust boundary, security-privacy concerns, and risk-mitigation perspectives. We further annotate each study using fine-grained coding attributes, including target, lifecycle stage, mechanism, system access, and effect. Guided by this framework and taxonomy, we systematize 96 papers. Our analysis uncovers multiple threat patterns, defense mismatches, and evaluation gaps that are difficult to identify from a single-boundary perspective. Based on these findings, we identify open challenges and future directions to provide a research agenda for developing secure, privacy-preserving, and responsibly governed FM-powered robotic systems.2026-06-15T14:32:08Z21 pages, 2 figuresXueluan GongChen ChenJinxin LiuQian WangKwok-Yan Lamhttp://arxiv.org/abs/2606.16776v1DataLadder: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid2026-06-15T14:21:35ZGeneralist robot policies require trustworthy evaluation and robot-usable training data, but both are difficult to scale with physical robots alone. Real-robot trials and demonstrations remain the most faithful source of deployment signals, yet they are slow, costly, and hard to reproduce. We present DataLadder, a simulation-enabled interconversion toolchain for human-robot aligned model evaluation and data generation, denoted as Robot $\rightleftharpoons$ Simulation $\rightleftharpoons$ Human. On the one hand, the Robot $\rightarrow$ Simulation $\rightarrow$ Human pathway supports human-robot aligned model evaluation by reconstructing real-robot tabletop organization tasks as calibrated digital twins for scalable evaluation, while using human embodied feedback to inspect and refine the naturalness of simulated motions. On the other hand, the Human $\rightarrow$ Simulation $\rightarrow$ Robot pathway supports human-robot aligned data generation: it lifts ego-centric human demonstrations into simulation, checks them under robot physical constraints, and converts them into robot-centered trajectories, annotations, and visual observations. Together, these pathways use the JoySim simulator as both a scalable evaluation layer and a physical consistency filter for robot data generation. We further package the core reconstruction, simulation, rendering, and realism-augmentation modules as cloud services on JD Cloud, turning the system into reusable infrastructure for robot data generation and model evaluation.2026-06-15T14:21:35ZProject Page: https://joyai-sim.github.io/Peidong LiuYongce LiuSongyan GuoFuyuan MaZhihao YuanAo LiZengjue ChenWenhao LiTianle ZhangMingyang LiJiale ZhangJunzhe XiongZhiyuan XiangDafeng ChiYuzheng ZhuangYihang LiQingrong HeJiaming LiangChen CaiPeng HaoMingxi LuoSong WangJunwu XiongRuodai LiLiyi LuoWei TanDongjiang LiJiawei LiHui ShenYicheng GongLiang Lin