https://arxiv.org/api/vaufPuElCMiPYeCiqpi8BxM7HLc 2026-06-22T11:28:51Z 54510 300 15 http://arxiv.org/abs/2606.16542v1 ADAPT: Analytical Disturbance-Aware Policy Training for Humanoid Locomotion 2026-06-15T10:47:18Z Humanoids deployed in human-centered environments must handle force-interactive tasks, where external contacts introduce unexpected disturbances that disrupt locomotion accuracy and stability. Existing learning-based approaches rely on broad domain randomization, task-specific force objectives, or learning-based force estimators from motion history, each of which compromises accuracy, task transferability, or out-of-distribution (OOD) robustness. We present Analytical Disturbance-Aware Policy Training (ADAPT), a framework that equips humanoid policies with a physically grounded disturbance observer. The core of ADAPT is an analytical whole-body disturbance observer that estimates residual force/torque online with the accessible robot dynamics, without requiring force/torque sensors. Fed directly into the policy, the estimated disturbances give the humanoid an explicit, physics-derived sense of external force/torque that can generalize across diverse unseen scenes. Experiments on a Unitree G1 humanoid show that ADAPT achieves accurate disturbance prediction and stronger robustness than a proprioception-only baseline under torso perturbations, standing pushes, and asymmetric hand payloads, with improved velocity tracking even on OOD disturbances. Moreover, ADAPT enables penalizing inferred disturbances at lower-body joints to encourage lighter locomotion. 2026-06-15T10:47:18Z Bofan Lyu Jindou Jia Kuangji Zuo Yanshuo Lu Shijia Han Gen Li Boyu Ma Jingliang Li Geng Li Jianfei Yang http://arxiv.org/abs/2606.16515v1 Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning 2026-06-15T10:18:30Z Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates. 2026-06-15T10:18:30Z 17 pages, Accepted to the 2nd Workshop on Compositional Learning at ICML 2026 (Seoul, South Korea) Swaminathan S K Damiya Gondha Theyanesh Eswaramoorthy Rajahkrishnan Aritra Hazra http://arxiv.org/abs/2606.16513v1 Agile Fall Recovery for Quadrotors with Bidirectional Thrust via Reinforcement Learning 2026-06-15T10:15:10Z Autonomous fall recovery is a critical capability for quadrotors operating in real-world environments, where collisions or failures may leave the vehicle resting on the ground in an arbitrary attitude. This problem is challenging because recovery must be achieved under limited onboard sensing, in constrained free space, with ground contact, and in the presence of unknown disturbances. In this letter, we present an RL-based framework for autonomous fall recovery of a quadrotor from arbitrary ground attitudes to stable hover using only lightweight onboard sensors. To address severe partial observability and intermittent sensor invalidity, we train a recurrent policy within an asymmetric actor--critic architecture, leveraging an Incremental Nonlinear Dynamic Inversion (INDI) controller to track the policy output. Combined with high-fidelity simulations of motor response and optical flow, the overall training framework significantly reduces the sim-to-real gap. Simulation ablation studies validate the importance of the main design choices, while real-world experiments demonstrate zero-shot transfer and robust recovery under different initial attitudes, wind disturbances, and additional payloads. These results demonstrate that agile quadrotor fall recovery can be achieved without explicit state estimation using only limited and unreliable onboard sensing. 2026-06-15T10:15:10Z Anke Zhao Yuhang Zhong Kenghou Hoi Junyu Mou Junjie Wang Lijie Wang Jialiang Hou Fei Gao http://arxiv.org/abs/2606.16504v1 APEX: Adaptive Policy Execution for Precise Manipulation 2026-06-15T10:06:40Z Modern imitation learning methods, including visuomotor and Vision-Language-Action (VLA) policies, typically output high-level action references that are executed by low-level controllers. However, the absence of higher-order reference signals, together with the policy's lack of awareness of the underlying low-level control dynamics during training, inevitably induces an execution gap. As a result, realized actions deviate systematically from policy-commanded ones, with a critical impact on precision-sensitive manipulation. Prior work either modifies the policy architecture or the low-level controller, both requiring intrusive changes to the pretrained policy or packaged controller. This raises a natural question: when the policy and controller are both treated as inaccessible black boxes, can we bridge the execution gap? We propose Adaptive Policy Execution (APEX), a plug-and-play framework inserted between the policy and the controller that reconstructs a dynamically feasible reference from policy outputs and adapts at test-time according to low-level state feedback, with a provable convergence guarantee. Extensive empirical studies show that APEX reduces controller-induced tracking error by 41.2% on demonstration replay and improves manipulation success by 4.8--25.8 percentage points across four visuomotor and VLA policy classes. 2026-06-15T10:06:40Z 20 pages, 9 figures, 4 tables Mengfei Zhao Chenxi Jiang Tuo An Jindou Jia Jianfei Yang http://arxiv.org/abs/2606.16491v1 HATS: A Human-Agent Teleoperation System for Multi-Arm Data Collection 2026-06-15T09:56:24Z Many real-world manipulation scenarios, such as handling complex collaborative tasks and dealing with large workspaces, require coordination of more than two robotic arms. Consequently, an effective multi-arm teleoperation system is required to collect demonstrations for training coordinated multi-arm manipulation policies. However, existing teleoperation frameworks mainly focus on single-operator or multi-operator setups, facing a practical trade-off between the cognitive load placed on a single operator and the coordination cost incurred by multiple operators. To address this problem, we introduce HATS, a human-agent teleoperation system that enables a single human operator, assisted by an MLLM-based agent, to collect data for multi-arm manipulation tasks. Our system decouples the control space: two primary arms are directly teleoperated by the human, while two assistive arms are controlled by a training-free agent that handles sub-tasks. In addition, the human operator can use voice commands to prevent collisions and correct assistive arm behaviors during execution. Extensive evaluations demonstrate that HATS achieves data collection efficiency and success rates comparable to expert dual-human teams. Moreover, downstream policy evaluations demonstrate the efficacy and quality of the data collected through HATS. 2026-06-15T09:56:24Z Zesen Lin Jian-Jian Jiang Haoming Cen Xiao-Ming Wu Dandan Zhang Wei-Shi Zheng http://arxiv.org/abs/2606.16490v1 Robots that Collaborate: Sequential Asymmetric Imitation for Learning Coupled Robot Policies 2026-06-15T09:55:58Z Collaborative mobile manipulation requires robots to coordinate with a partially observed partner while physically interacting through shared objects. This is difficult because failures often arise not from poor local skills, but from mistimed waiting, yielding, pulling, releasing, or repositioning. We study this problem with two bimanual mobile manipulators coupled through rigid and deformable objects. We propose Sequential Asymmetric Imitation (SAI), a single-teleoperator curriculum for learning coupled multi-robot behaviors without synchronized dual-operator demonstrations or explicit inter-robot communication. SAI trains Robot A from unilateral demonstrations with a compliant human partner, trains Robot B against the deployed Robot A policy, and then refines Robot A using sparse interventions near coordination failures. This staged process exposes the policies to increasingly realistic partner behaviors, including delay, phase mismatch,insufficient yielding, and interaction conflict. Across real-world dual-robot manipulation tasks, SAI improves task success, phase synchronization, and partner-contingent yielding over independent imitation and curriculum-ablation baselines. These results suggest that physically coupled collaboration can be learned through the structure of the imitation curriculum, rather than through synchronized multi-operator demonstrations or explicit coordination mechanisms.Project page:http://cyc0429.github.io/sai-project-page/ 2026-06-15T09:55:58Z Yincong Chen Ranpeng Qiu Zihao Li Yanan Zhou Guoqiang Ren Weiming Zhi http://arxiv.org/abs/2606.16480v1 HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization 2026-06-15T09:49:14Z Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control. 2026-06-15T09:49:14Z Youngjae Min Jovin D'sa Faizan M. Tariq David Isele Navid Azizan Sangjae Bae http://arxiv.org/abs/2606.16474v1 MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry 2026-06-15T09:39:28Z Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM. 2026-06-15T09:39:28Z 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L) Jituo Li Shunwang Sun Jialu Zhang Xinqi Liu Jinyao Hu Zhicheng Lu Sajad Saeedi Guodong Lu http://arxiv.org/abs/2606.16470v1 Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands 2026-06-15T09:36:44Z Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design. 2026-06-15T09:36:44Z Thanh Nguyen Canh Thanh-Tuan Tran Haolan Zhang Ziyan Gao Xiem HoangVan Nak Young Chong http://arxiv.org/abs/2606.16467v1 A Formal Resilience Framework for Cyber-Physical Embodied Systems under Device-Level Cyberattacks 2026-06-15T09:35:04Z In cyber-physical systems (CPSs), fault tolerance is traditionally achieved by analysing sensor and actuator outputs, detecting progressive drift or sudden failures, and initiating suitable tolerance mechanisms. Reasonable under general failure models, this approach fails to capture nuanced disruptions caused by cyberattacks, which may employ subtle strategies. This is particularly critical in embodied CPSs, where computational and physical devices not only have an active role in task completion, but also in embodiment preservation (that is, maintaining the system's physical integrity). To prevent structural physical damage, embodied CPSs require a framework that enables proactive response to cyberattacks. This paper proposes a formal dependability framework that incorporates IDS information into resilience evaluation predicates, enabling assessment of tolerance to disruption and degradation. The framework supports structured reasoning about how cyberattacks affect task execution and embodiment preservation, and whether mitigation strategies must be deployed. Analytical examples demonstrate its analytical capability and soundness, establishing a theoretical foundation for dependable and secure embodied CPSs. 2026-06-15T09:35:04Z 8 pages, 2 tables Alberto Giaretta http://arxiv.org/abs/2604.00768v2 An Ergonomic, Customizable Soft Robotic Glove toward Personalized Hand Rehabilitation 2026-06-15T09:33:55Z Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and user comfort constrain their clinical utility. Here, we present an ergonomic, customizable fabric-based soft robotic glove whose actuators can be tailored to individual finger-joint geometry. The glove comprises five dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby increasing finger contact area and improving comfort. Characterization confirmed joint moment and grasping force sufficient for ADL-relevant tasks. In ten healthy subjects, active assistance significantly reduced forearm muscle activity during manipulation, and a pilot study in three individuals with cervical spinal cord injury showed more natural grasp patterns and reduced reliance on tenodesis grasp. 2026-04-01T11:32:09Z Rui Chen Firman Isma Serdana Domenico Chiaradia Xianlong Mai Elena Losanno Gabriele Righi Claudia De Santis Federica Serra Vincent Mendez Cristian Camardella Daniele Leonardis Giulio Del Popolo Silvestro Micera Antonio Frisoli http://arxiv.org/abs/2606.16458v1 RHO: Your Coding Agent is Secretly a Roboticist 2026-06-15T09:30:12Z Code-as-Policies (CaP) has shown that large language models (LLMs) can write code to solve robotics tasks by composing perception, planning, and control primitives. Recent CaP systems, however, rely on multi-turn code-generation loops at test time, which is often infeasible for real-time robot control. We introduce Robotics Harness Optimization (RHO), a novel paradigm in which tool-enabled coding agents, at training time, propose and search for interpretable, neurosymbolic multi-file policy repositories (Repositories-as-Policies) that compose these primitives rather than a single prompt, function, or file. RHO searches with reflective feedback from environment reward and execution rather than teleoperation demonstrations. It generalizes to perturbed pick-and-place settings like LIBERO-PRO, where OpenVLA scores 0.0% and $π_{0.5}$ averages 12.83%. Using the same low-level primitives, RHO reaches a 45.0% success rate, 2.5x higher than the strongest multi-turn agentic system, and 3.5x higher than $π_{0.5}$. On Robosuite, RHO sets a new state-of-the-art of 70.0%, exceeding the prior multi-turn record of 68.29% using single-turn execution with no corrective LLM code edits at deployment. When an LLM is used in the control loop, as on RAI's O3DE benchmark, RHO optimizes the deployed agent's multi-file harness of prompts, tools, and control code, improving held-out success from 23.5% to 44.3% with 20% less wall-clock time and 27% fewer tool calls. 2026-06-15T09:30:12Z 46 pages, 9 figures, 15 tables. Project page: https://rho-robotics.github.io Karim Elmaaroufi Justin Svegliato Sarunas Kalade Graham Schelle Sanjit A. Seshia Matei Zaharia http://arxiv.org/abs/2606.16447v1 Training and Evaluating Diffusion Policies with Long Context Lengths 2026-06-15T09:19:34Z Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning. 2026-06-15T09:19:34Z Abhinav Agarwal Adam Wei Taylan Kargin Michael Zeng Cole Becker Arif Kerem Dayi Pablo Parrilo Asuman Ozdaglar Russ Tedrake http://arxiv.org/abs/2606.14238v2 When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs 2026-06-15T09:17:08Z Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard. 2026-06-12T08:20:06Z Abhinaw Priyadershi Jelena Frtunikj http://arxiv.org/abs/2606.16436v1 V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos 2026-06-15T09:08:11Z Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments. 2026-06-15T09:08:11Z Kaihan Chen Yanming Shao Haifeng Ji Xiaokang Yang Yao Mu