https://arxiv.org/api/OF8Hlhz+7NbKBXi5/NNONhiO92c 2026-06-22T22:46:49Z 54510 450 15 http://arxiv.org/abs/2603.05230v2 Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems 2026-06-12T13:56:54Z

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

2026-03-05T14:42:19Z 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria) Serkan Ergun Tobias Mitterer Hubert Zangl http://arxiv.org/abs/2606.14433v1 Kine2Go: Kinematic dataset for the Unitree Go2 robot with diverse gaits and motions 2026-06-12T13:13:53Z

The recent popularity of robotics, combined with the steadily decreasing cost of robotic hardware, has lowered the entry barrier to robotics research and enabled rapid advancements in the field. One of the primary examples is the Unitree Go2 quadruped robot, which is often used by researchers in the areas of locomotion, navigation, control, and others. Many researchers use the Go2 robot in combination with techniques like imitation learning, reinforcement learning, and behavioral cloning to allow machine learning systems to take full control of the robot. At the same time, many of those techniques require demonstration data consisting of the robot's kinematics information and actions applied to the motors. Obtaining such data is difficult, requires building complex pipelines, and can take significant time. To aid in those kinds of efforts, we present Kine2Go - a dataset with 800 diverse gait kinematics trajectory motion data for the Unitree Go2 robot, derived from 40 distinct policies. Our pipeline accepts data from various quadruped morphologies and translates them to a Go2-compatible format. Then we use Reinforcement Learning to train policies following a given motion, and finally we gather data from those policies, which grants robust, perturbed kinematic data with corresponding motor-level actions.

2026-06-12T13:13:53Z 9 pages, 6 figures Władysław Pałucki Paweł Siwak Krzysztof Ciebiera Marek Cygan http://arxiv.org/abs/2606.14421v1 ForestBack: Breadcrumb-Based Pedestrian Dead Reckoning for Infrastructure-Free Return Navigation 2026-06-12T12:57:57Z

Reliable return navigation remains an important challenge in GPS-denied environments where external positioning infrastructure may be unavailable or unreliable. This paper presents ForestBack, an infrastructure-free pedestrian return navigation framework based on breadcrumb-based pedestrian dead reckoning (PDR). The system records a user's walking route as a sequence of reversible breadcrumb nodes and generates reverse-path guidance without requiring GPS, Wi-Fi, Bluetooth beacons, or pre-installed infrastructure. ForestBack integrates acceleration-based step detection, adaptive step-length estimation, magnetometer-assisted heading estimation, barometric-altitude correction, and bidirectional breadcrumb path reconstruction. The system was evaluated using an indoor obstacle-avoidance route with five checkpoints, where the user navigated around a central obstacle. A dataset of 36 walking trials and 42,474 time-series samples was used for evaluation, including IMU signals, magnetometer readings, barometric variables, turn-event labels, ground-truth trajectories, baseline PDR outputs, proposed ForestBack outputs, and power-related measurements. Experimental results show that ForestBack reduced the mean RMSE from 1.129 m to 0.965 m compared with traditional PDR, corresponding to a 15.76% improvement. The mean final-position error was reduced from 1.781 m to 1.388 m, while turn-event detection consistency reached approximately 99.90%. These results indicate that ForestBack improves trajectory reconstruction and route-preserving return guidance in obstacle-avoidance scenarios. The released dataset and analysis notebook support reproducibility and future benchmarking of infrastructure-free PDR-based return navigation systems.

2026-06-12T12:57:57Z 9 pages, 6 figures, 1 table, and 19 equations Aueaphum Aueawatthanaphisut Chanakan Chaipan http://arxiv.org/abs/2606.14418v1 Causal Object-Centric Models for Planning with Monte Carlo Tree Search 2026-06-12T12:55:25Z

We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

2026-06-12T12:55:25Z Rodion Vakhitov Leonid Ugadiarov Alexey Skrynnik Aleksandr Panov http://arxiv.org/abs/2606.14409v1 Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack 2026-06-12T12:45:18Z

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

2026-06-12T12:45:18Z He Zhang Lingzhu Xiang Haitao Lin Zeyu Huang Minghui Wang Dingyan Zhong Yubo Dong Yihao Wu Yongming Rao Dongsheng Zhang Wanjia He Ling Chen Kai Huang Jiahao Chen Sichang Su Xumin Yu Ziyi Wang Chengwei Zhu Xiao Teng Yuchun Guo Yufeng Zhang Yuandong Liu Rui Wang Zisheng Lu Han Hu Zhengyou Zhang http://arxiv.org/abs/2606.04718v3 CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation 2026-06-12T12:08:10Z

Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.

2026-06-03T10:51:46Z Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu, Haohui Huang and Chenguang Yang Kailun Huang Hong Kong University of Science and Technology Zikang Xie Hong Kong University of Science and Technology Yanzhe Xie Hong Kong University of Science and Technology Panpan Liao Guangdong University of Technology Fanghai Zhang Hong Kong University of Science and Technology Yanheng Mai Hong Kong University of Science and Technology Wenhao Xu South China Agricultural University Yunheng Wang Hong Kong University of Science and Technology Renjing Xu Hong Kong University of Science and Technology Haohui Huang Guangdong University of Technology Chenguang Yang The Hong Kong Polytechnic University http://arxiv.org/abs/2606.14375v1 Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-12T12:06:41Z

Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.

2026-06-12T12:06:41Z Ge Wang Xinyu Tan Xiang Li Man Luo Chengsi Yao Shenhao Yan Jiahao Yang Fan Feng Honghao Cai Xiangyuan Wang Zhixin Mai Yiming Zhao Yatong Han Zhen Li http://arxiv.org/abs/2512.21201v3 Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation 2026-06-12T10:14:19Z

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

2025-12-24T14:28:17Z Yu He Da Huang Zhenyang Liu Zixiao Gu Qiang Sun Guangnan Ye Yanwei Fu Yu-Gang Jiang http://arxiv.org/abs/2503.14331v4 ADAPT: An Autonomous Forklift for Construction Site Operation 2026-06-12T09:22:40Z

Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of ADAPT (Autonomous Dynamic All-terrain Pallet Transporter), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its continuous performance against an experienced human operator across various weather conditions. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.

2025-03-18T15:03:28Z Johannes Huemer Markus Murschitz Matthias Schörghuber Lukas Reisinger Thomas Kadiofsky Christoph Weidinger Mario Niedermeyer Benedikt Widy Marcel Zeilinger Csaba Beleznai Tobias Glück Andreas Kugi Patrik Zips http://arxiv.org/abs/2606.14270v1 Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning 2026-06-12T08:51:51Z

Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at https://2350575870.github.io/force-guided.github.io/

2026-06-12T08:51:51Z 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L) IEEE Robotics and Automation Letters, 2026 Haidong Hou Zhangguo Yu Tao Han Hengbo Qi Khaleel Ghazal Yu Zhang Yidong Du Xuechao Chen Fei Meng 10.1109/LRA.2026.3701481 http://arxiv.org/abs/2606.14267v1 FloVerse: Floor Plan-Guided Multi-Modal Navigation 2026-06-12T08:49:53Z

Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan-guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan-guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy comprising a planner, a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling, and a refiner, a depth-based trajectory-refinement module for safe execution. Extensive experiments demonstrate that (1) floor-plan priors improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly captures spatial information from floor plans. These results underscore the effectiveness of spatial priors and validate our proposed unified approach for floor plan-guided embodied navigation.

2026-06-12T08:49:53Z Accepted at CVPR 2026 Weiqi Huang Shuangyi Dong Jiaxin Li Yifei Guo Zan Wang Wei Liang http://arxiv.org/abs/2603.03733v2 X-Loco: Towards Generalist Humanoid Locomotion Control via Synergetic Policy Distillation 2026-06-12T08:49:21Z

While recent advances have demonstrated strong performance in individual humanoid skills such as upright locomotion, fall recovery and whole-body coordination, learning a single policy that masters all these skills remains challenging due to the diverse dynamics and conflicting control objectives involved. To address this, we introduce X-Loco, a framework for training a vision-based generalist humanoid locomotion policy. X-Loco trains multiple oracle specialist policies and adopts a synergetic policy distillation with a case-adaptive specialist selection mechanism, which dynamically leverages multiple specialist policies to guide a vision-based student policy. This design enables the student to acquire a broad spectrum of locomotion skills, ranging from fall recovery to terrain traversal and whole-body coordination skills. To the best of our knowledge, X-Loco is the first framework to demonstrate vision-based humanoid locomotion that jointly integrates upright locomotion, whole-body coordination and fall recovery, while operating solely under velocity commands without relying on reference motions. Experimental results show that X-Loco achieves superior performance, demonstrated by tasks such as fall recovery and terrain traversal. Ablation studies further highlight that our framework effectively leverages specialist expertise and enhances learning efficiency.

2026-03-04T05:07:05Z Accepted by RSS 2026. Project page: https://x-loco-humanoid.github.io/ Dewei Wang Xinmiao Wang Chenyun Zhang Jiyuan Shi Yingnan Zhao Chenjia Bai Xuelong Li http://arxiv.org/abs/2606.12728v2 EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows 2026-06-12T08:41:03Z

Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

2026-06-10T22:27:03Z 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io Clinton Enwerem John S. Baras Calin Belta http://arxiv.org/abs/2606.14255v1 ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation 2026-06-12T08:33:37Z

Diffusion-based Vision-Language-Action (VLA) policies have demonstrated strong capability in modeling expressive and multimodal action distributions. However, their reliance on iterative sampling introduces substantial inference latency, which limits their applicability to reactive closed-loop robot manipulation. To address this limitation, we propose \texttt{ReactVLA}, a lightweight and low-latency VLA framework for real-time robotic manipulation. \texttt{ReactVLA} combines two complementary designs: (1) an improved Mean Flow (iMF) action generator that reduces expensive multi-step diffusion sampling to one-to-few-step action generation, and (2) Attention Residuals (AttnRes), a dynamic depth-wise feature routing mechanism that replaces uniform residual accumulation to better preserve task-relevant multimodal representations. We evaluate \texttt{ReactVLA} on large-scale simulation benchmarks, including LIBERO and RoboIMI, as well as real-world robotic manipulation tasks. Experimental results show that \texttt{ReactVLA} consistently outperforms similarly sized VLA baselines, including SmolVLA and $π_0$. On challenging precision manipulation tasks, \texttt{ReactVLA} achieves up to a 1.65$\times$ improvement in task performance while providing more than a 4$\times$ increase in inference speed compared with leading VLA models. Finally, it reduces real-world policy latency to below 38.6 ms, enabling fast reactive control on physical robot platforms. Please check out our project website at: https://game-loader.github.io/ReactVLA/.

2026-06-12T08:33:37Z Yanzhao Guo Wenkai Chen Jianwei Zhang http://arxiv.org/abs/2606.14252v1 Optimality-Preserving Decomposition for Scalable QAOA in Natural-Language-Guided Multi-Drone Assignment 2026-06-12T08:31:14Z

As multi-drone fleets scale, zone assignment rapidly evolves into an intractable NP-hard combinatorial problem that overwhelms classical exhaustive search. While quantum optimization promises to shatter these classical bottlenecks, mapping complex spatial tasks from human intent to restricted quantum hardware remains a severe challenge. To bridge this gap, we present an end-to-end framework integrating a fine-tuned Large Language Model (LLM) front-end with a highly scalable, domain-specific quantum-classical backend. The front-end utilizes Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to translate free-form natural language instructions into structurally robust Quadratic Unconstrained Binary Optimization (QUBO) constraints without false negatives. To overcome the strict qubit limits of near-term quantum devices, our framework features a novel constraint-preserving graph partitioner and a compressed separator-based dynamic programming (DP) merge. By structurally encoding constraints via W-state initialization and XY-mixers in Conditional Value-at-Risk Quantum Approximate Optimization (CVaR-QAOA), the pipeline stays highly compact. Empirical results demonstrate that this architecture circumvents classical scaling walls, recovering the global optimum on 100% of idealized oracle cases and 96.3% under real QAOA sampling, enabling natural-language-guided task allocation at previously intractable scales.

2026-06-12T08:31:14Z 10 pages, 2 figures, 3 tables, preprint Junyeop Bang Byongho Lee Dohyun An Hwangnam Kim