https://arxiv.org/api/ua4BgW92XZPFpei5jt4W2xm8lGs2026-03-26T08:18:14Z511306015http://arxiv.org/abs/2410.06819v3Dynamic Neural Potential Field: Online Trajectory Optimization in the Presence of Moving Obstacles2026-03-24T20:26:37ZGeneralist robot policies must operate safely and reliably in everyday human environments such as homes, offices, and warehouses, where people and objects move unpredictably. We present Dynamic Neural Potential Field (NPField-GPT), a learning-enhanced model predictive control (MPC) framework that couples classical optimization with a Transformer-based predictor of footprint-aware repulsive potentials. Given an occupancy sub-map, robot footprint, and optional dynamic-obstacle cues, our NPField-GPT model forecasts a horizon of differentiable potentials that are injected into a sequential quadratic MPC program via L4CasADi, yielding real-time, constraint-aware trajectory optimization. We additionally study two baselines: NPField-StaticMLP, where a dynamic scene is treated as a sequence of static maps; and NPField-DynamicMLP, which predicts the future potential sequence in parallel with an MLP. In dynamic indoor scenarios from BenchMR and on a Husky UGV in office corridors, NPField-GPT produces more efficient and safer trajectories under motion changes, while StaticMLP/DynamicMLP offer lower latency. We also compare with the CIAO* and MPPI baselines. Across methods, the Transformer+MPC synergy preserves the transparency and stability of model-based planning while learning only the part that benefits from data: spatiotemporal collision risk. Code and trained models are available at https://github.com/CognitiveAISystems/Dynamic-Neural-Potential-Field2024-10-09T12:27:09ZAleksei StaroverovMuhammad AlhaddadAditya NarendraKonstantin MironovAleksandr Panovhttp://arxiv.org/abs/2603.23690v1ROSCell: A ROS2-Based Framework for Automated Formation and Orchestration of Multi-Robot Systems2026-03-24T19:56:32ZModern manufacturing under High-Mix-Low-Volume requirements increasingly relies on flexible and adaptive matrix production systems, which depend on interconnected heterogeneous devices and rapid task reconfiguration. To address these needs, we present ROSCell, a ROS2-based framework that enables the flexible formation and management of a computing continuum across various devices. ROSCell allows users to package existing robotic software as deployable skills and, with simple requests, assemble isolated cells, automatically deploy skill instances, and coordinate their communication to meet task objectives. It provides a scalable and low-overhead foundation for adaptive multi-robot computing in dynamic production environments. Experimental results show that, in the idle state, ROSCell substantially reduces CPU, memory, and network overhead compared to K3s-based solutions on edge devices, highlighting its energy efficiency and cost-effectiveness for large-scale deployment in production settings. The source code, examples, and documentation will be provided on Github.2026-03-24T19:56:32ZJiangtao ShuaiMarvin Carl MaySonja SchimmlerManfred Hauswirthhttp://arxiv.org/abs/2603.12665v2TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation2026-03-24T19:50:43ZVision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.2026-03-13T05:20:41Z9 pages, 7 figuresKaidi ZhangHeng ZhangZhengtong XuZhiyuan ZhangMd Rakibul Islam PrinceXiang LiXiaojing HanYuhao ZhouArash AjoudaniYu Shehttp://arxiv.org/abs/2603.23679v1Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting2026-03-24T19:35:22ZAgriculture remains a cornerstone of global health and economic sustainability, yet labor-intensive tasks such as harvesting high-value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception-to-action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision-making. Our approach combines RGB-D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6--8% higher accuracy than random sampling and enabling label-efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy- and margin-based sampling outperform Query-by-Committee and standard uncertainty sampling in low-label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task-level perception in agricultural robotics and position our approach as a scalable alternative to computation-heavy kinematic reachability analysis. Our code is available through https://github.com/wsu-cyber-security-lab-ai/active-learning.2026-03-24T19:35:22ZNur Afsa SyedaMohamed ElmahallawyLuis Fernando de la TorreJohn Millerhttp://arxiv.org/abs/2603.23676v1Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement2026-03-24T19:31:13ZWe study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a "which-target-region" mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.2026-03-24T19:31:13ZAshish MalikCaleb LoweAayam ShresthaStefan LeeFuxin LiAlan Fernhttp://arxiv.org/abs/2509.16136v4Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning2026-03-24T19:30:19ZDesigning effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.2025-09-19T16:35:27ZChangwei YaoXinzi LiuChen LiMarios Savvideshttp://arxiv.org/abs/2603.23672v1Bio-Inspired Event-Based Visual Servoing for Ground Robots2026-03-24T19:20:20ZBiological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper presents a novel event-based visual servoing framework for ground robots. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state-feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.2026-03-24T19:20:20ZMaral MordadKian BehzadDebojyoti BiswasNoah J. CowanMilad Siamihttp://arxiv.org/abs/2603.23666v1Quadrature Oscillation System for Coordinated Motion in Crawling Origami Robot2026-03-24T19:10:19ZOrigami-inspired robots offer rapid, accessible design and manufacture with diverse functionalities. In particular, origami robots without conventional electronics have the unique advantage of functioning in extreme environments such as ones with high radiation or large magnetic fields. However, the absence of sophisticated control systems limits these robots to simple autonomous behaviors. In our previous studies, we developed a printable, electronics-free, and self-sustained oscillator that generates simple complementary square-wave signals. Our study presents a quadrature oscillation system capable of generating four square-wave signals a quarter-cycle out of phase, enabling four distinct states. Such control signals are important in various engineering and robotics applications, such as orchestrating limb movements in bio-inspired robots. We demonstrate the practicality and value of this oscillation system by designing and constructing an origami crawling robot that utilizes the quadrature oscillator to achieve coordinated locomotion. Together, the oscillator and robot illustrate the potential for more complex control and functions in origami robotics, paving the way for more electronics-free, rapid-design origami robots with advanced autonomous behaviors.2026-03-24T19:10:19Z8 pages, 11 figures, Accepted to ICRA 2026Sean LiuAnkur MehtaWenzhong Yanhttp://arxiv.org/abs/2603.18336v2ManiDreams: An Open-Source Library for Robust Object Manipulation via Uncertainty-aware Task-specific Intuitive Physics2026-03-24T18:49:17ZDynamics models, whether simulators or learned world models, have long been central to robotic manipulation, but most focus on minimizing prediction error rather than confronting a more fundamental challenge: real-world manipulation is inherently uncertain. We argue that robust manipulation under uncertainty is fundamentally an integration problem: uncertainties must be represented, propagated, and constrained within the planning loop, not merely suppressed during training. We present and open-source ManiDreams, a modular framework for uncertainty-aware manipulation planning over intuitive physics models. It realizes this integration through composable abstractions for distributional state representation, backend-agnostic dynamics prediction, and declarative constraint specification for action optimization. The framework explicitly addresses three sources of uncertainty: perceptual, parametric, and structural. It wraps any base policy with a sample-predict-constrain loop that evaluates candidate actions against distributional outcomes, adding robustness without retraining. Experiments on ManiSkill tasks show that ManiDreams maintains robust performance under various perturbations where the RL baseline degrades significantly. Runnable examples on pushing, picking, catching, and real-world deployment demonstrate flexibility across different policies, optimizers, physics backends, and executors. The framework is publicly available at https://github.com/Rice-RobotPI-Lab/ManiDreams2026-03-18T22:46:46Z9 pages, 10 figures. Project page at https://rice-robotpi-lab.github.io/ManiDreams/Gaotian WangKejia RenAndrew S. MorganKaiyu Hanghttp://arxiv.org/abs/2603.23649v1Engagement-Zone-Aware Input-Constrained Guidance for Safe Target Interception in Contested Environments2026-03-24T18:49:13ZWe address target interception in contested environments in the presence of multiple defenders whose interception capability is limited by finite ranges. Conventional methods typically impose conservative stand-off constraints based on maximum engagement distance and neglect the interceptors' actuator limitations. Instead, we formulate safety constraints using defender-induced engagement zones. To account for actuator limits, the vehicle model is augmented with input saturation dynamics. A time-varying safe-set tightening parameter is introduced to compensate for transient constraint violations induced by actuator dynamics. To ensure scalable safety enforcement in multi-defender scenarios, a smooth aggregate safety function is constructed using a log-sum-exp operator combining individual threat measures associated with each defender's capability. A smooth switching guidance strategy is then developed to coordinate interception and safety objectives. The attacker pursues the target when sufficiently distant from threat boundaries and progressively activates evasive motion as the EZ boundaries are approached. The resulting controller relies only on relative measurements and does not require knowledge of defender control inputs, thus facilitating a fully distributed and scalable implementation. Rigorous analysis provides sufficient conditions guaranteeing target interception, practical safety with respect to all defender engagement zones, and satisfaction of actuator bounds. An input-constrained guidance law based on conservative stand-off distance is also developed to quantify the conservatism of maximum-range-based safety formulations. Simulations with stationary and maneuvering defenders demonstrate that the proposed formulation yields shorter interception paths and reduced interception time compared with conventional methods while maintaining safety throughout the engagement.2026-03-24T18:49:13ZPraveen Kumar RanjanAbhinav SinhaYongcan Caohttp://arxiv.org/abs/2603.23607v1LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset2026-03-24T18:00:08ZIn real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail2026-03-24T18:00:08Z21 pagesRoyden WagnerOmer Sahin TasJaime VillaFelix HauserYinzhe ShenMarlon SteinerDominik StrutzCarlos FernandezChristian KinzigGuillermo S. Guitierrez-CabelloHendrik KönigshofFabian ImmelRichard SchwarzkopfNils Alexander RackKevin RöschKaiwen WangJan-Hendrik PaulsMartin LauerIgor GilitschenskiHolger CaesarChristoph Stillerhttp://arxiv.org/abs/2603.04714v2Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors2026-03-24T17:48:38ZRobots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space -- the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.2026-03-05T01:29:30ZThis work was accepted at the International Conference on Robotics and Automation (ICRA) 2026Carson KohlbrennerAnna SoukhoveiCaleb EscobedoNataliya NechyporenkoAlessandro Ronconehttp://arxiv.org/abs/2603.23481v1VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs2026-03-24T17:45:06ZVideo-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.2026-03-24T17:45:06Zhttps://plan-lab.github.io/projects/vtam/Haoran YuanWeigang YiZhenyu ZhangWendi ChenYuchen MoJiashi YinXinzhuo LiXiangyu ZengChuan WenCewu LuKatherine Driggs-CampbellIsmini Lourentzouhttp://arxiv.org/abs/2512.11773v4ProbeMDE: Uncertainty-Guided Active Proprioception for Monocular Depth Estimation in Surgical Robotics2026-03-24T17:13:24ZMonocular depth estimation (MDE) provides a useful tool for robotic perception, but its predictions are often uncertain and inaccurate in challenging environments such as surgical scenes where textureless surfaces, specular reflections, and occlusions are common. To address this, we propose ProbeMDE, a cost-aware active sensing framework that combines RGB images with sparse proprioceptive measurements for MDE. Our approach utilizes an ensemble of MDE models to predict dense depth maps conditioned on both RGB images and on a sparse set of known depth measurements obtained via proprioception, where the robot has touched the environment in a known configuration. We quantify predictive uncertainty via the ensemble's variance and measure the gradient of the uncertainty with respect to candidate measurement locations. To prevent mode collapse while selecting maximally informative locations to propriocept (touch), we leverage Stein Variational Gradient Descent (SVGD) over this gradient map. We validate our method in both simulated and physical experiments on central airway obstruction surgical phantoms. Our results demonstrate that our approach outperforms baseline methods across standard depth estimation metrics, achieving higher accuracy while minimizing the number of required proprioceptive measurements.
Project page: https://brittonjordan.github.io/probe_mde/2025-12-12T18:36:53Z8 pages, 5 figures. Accepted at ICRA 2026. Project page: https://brittonjordan.github.io/probe_mde/Britton JordanJordan ThompsonJesse F. d'AlmeidaHao LiNithesh KumarSusheela Sharma SternJames FergusonIpek OguzRobert J. WebsterDaniel BrownAlan Kuntzhttp://arxiv.org/abs/2603.17808v2EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards2026-03-24T17:12:47ZVideo generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.2026-03-18T15:02:19ZProject page: https://eva-project-page.github.io/Ruixiang WangQingming LiuYueci DengGuiliang LiuZhen LiuKui Jia