https://arxiv.org/api/piHIrRCE6cOZwMXOuKwmUogp34o2026-06-22T19:28:18Z5451040515http://arxiv.org/abs/2606.15171v1Seam-to-Graph Reconstruction for Garment Configuration Alignment2026-06-13T07:38:18ZSeams encode rich structural information about garments but are frequently partially observable in robotic manipulation scenarios. To robustly leverage seam information, we propose a Seam-to-Graph network based on graph neural networks and attention mechanisms. This network maps unstructured seam observations to a topology-encoded structural skeleton graph for real-time garment state estimation. Using this skeleton-graph-based state estimation, we design a deformation-aware, hierarchical visual servoing controller for garment configuration alignment. We implement this controller on a bimanual robot system to load a garment onto a screen printing platen and to align it to the desired configuration precisely. Real-robot experiments demonstrate that the robot using the proposed method not only achieves human-level alignment accuracy with reduced variance in alignment error but is also robust to different garments. These results demonstrate that the use of seam information is effective for garment manipulation.2026-06-13T07:38:18Z11 pages, 9 figuresXuzhao HuangKai TangFuyuki TokudaNorman C. TienKazuhiro Kosugehttp://arxiv.org/abs/2606.15165v1VLALeaks: Membership Inference Attacks against Vision-Language-Action Models2026-06-13T07:29:33ZVision-Language-Action (VLA) models enable end-to-end robot control and have garnered widespread attention. However, the memorization of training data inherent to VLA, coupled with the high cost of robotic data acquisition, raises serious concerns regarding data privacy leakage and intellectual property infringement. Membership inference attacks (MIAs) aim to determine whether a given sample belongs to the training set. While representing a significant privacy threat, this attack remains underexplored in the context of VLA models. To bridge this gap, we propose VLALeaks, which is based on attention discrepancies in VLA models. We reveal, for the first time, the privacy vulnerabilities of VLA models. Specifically, it comprises a two-stage process: (1) membership feature extraction, and (2) attack model construction. Experimental results across multiple VLA benchmarks demonstrate that VLALeaks readily reveals membership information and achieves optimal attack AUC and TPR@1\%FPR, highlighting the privacy vulnerabilities in current VLA model deployments. Our work is the first systematic study of MIAs on VLA models, aiming to provide insights for secure and trustworthy VLA models.2026-06-13T07:29:33ZSecurity and PrivacyXukun LuanJinyan LiuXuesong LiYuanguo BiRenjun WuZhongxiang LeiDi Wanghttp://arxiv.org/abs/2606.15154v1Task-Aware Environment Augmentation for Reliable Navigation via Shielded Conditional Diffusion2026-06-13T06:49:36ZReliable trajectory planning under partial observability depends not only on computing a feasible geometric path, but also on whether the robot receives informative observations while executing that trajectory. Existing approaches usually keep the environment fixed and adapt the robot through belief-space planning, active localization, or added sensing, often incurring costly uncertainty propagation and brittle behavior in observation-poor regions. We flip this perspective and address the largely open problem of \emph{task-aware environment augmentation}: given a mapped environment, a planned task trajectory, and a small budget of visual fiducial markers, where should the environment be augmented so that the planned trajectory can be executed reliably under uncertainty? Our key observation is that useful marker layouts are defined by the localization support they provide along the task trajectory: a small number of well-timed observations can be sufficient to prevent uncertainty from accumulating in regions where state-estimation error would otherwise compromise control. Building on this observation, we present \tbp{SCoDA}, $\textbf{S}$hielded $\textbf{Co}$nditional $\textbf{D}$iffusion for Environment $\textbf{A}$ugmentation. \tbp{SCoDA} learns a conditional distribution over high-performing fiducial layouts from data, using the environment, planned trajectory, disturbance context, and desired execution profile as conditioning. Its shielded sampler reasons over where along the planned execution pose corrections should occur, and steers this distribution toward task-relevant, finite-budget augmentations. Across simulated benchmarks and hardware deployments, we show that \tbp{SCoDA} improves trajectory execution reliability and completion time over strong baselines.
Code, models and dataset available at: \hyperlink{scoda-diffusion.github.io}{https://scoda-diffusion.github.io/}2026-06-13T06:49:36ZBharawee PhoomphoGokul PuthumanaillamYan MiaoRuben HernandezTim BretlSayan MitraMelkior Ornikhttp://arxiv.org/abs/2606.15142v1MotionVLA: Vision-Language-Action Model for Humanoid Motion2026-06-13T06:10:48ZGenerating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.2026-06-13T06:10:48ZNonghai ZhangSiyu ZhaiYanjun LiZeyu ZhangZhihan YinYandong GuoBoxin ShiHao Tanghttp://arxiv.org/abs/2606.15139v1Self-Driving Negotiator: An interactive, verifiable benchmark for social negotiation and theory of mind under hidden intent2026-06-13T05:58:48ZAutonomous driving is full of tiny social negotiations: a driver presses forward, another yields, a pedestrian fakes toward the curb, or a lane vehicle chooses whether to open a merge gap. Such interactions require inferring hidden intent from behavior under partial observability and then acting safely and efficiently. Existing autonomous-driving language benchmarks mostly focus on perception, visual question answering, or open-loop planning, while existing language-agent negotiation benchmarks typically make the negotiation explicit in text. Self-Driving Negotiator bridges the gap between the two: a text-only, multi-turn, procedurally generated environment for measuring implicit social coordination in driving. Agents generate specific driving actions. Reward and diagnostics are computed from the privileged simulator state, not from the explanation of the model. This report covers task design, reward and anti-gaming invariants, validated scenarios, non-LLM baselines, and a six-model inference leaderboard. Current models are far removed from the scripted expert. The best average success rate across three scenarios is 0.68; contested merge is statistically flat across models; and difficulty tiers separate cue-following from true wait-for-commitment behavior.2026-06-13T05:58:48ZAshutosh Kumarhttp://arxiv.org/abs/2606.12978v2Trajectory-Level Redirection Attacks on Vision-Language-Action Models2026-06-13T05:51:59ZVision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/2026-06-11T07:12:17ZGokul PuthumanaillamVardhan DongrePranay ThangedaHooshang NayyeriDilek Hakkani-TürMelkior Ornikhttp://arxiv.org/abs/2606.15133v1DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects2026-06-13T05:47:14ZDexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.2026-06-13T05:47:14ZCode: https://github.com/AIGeeksGroup/DragMesh-2. Website: https://aigeeksgroup.github.io/DragMesh-2Tianshan ZhangYijia DuanYanjun LiZeyu ZhangHao Tanghttp://arxiv.org/abs/2602.00222v3MapDream: Task-Driven Map Learning for Vision-Language Navigation2026-06-13T05:39:42ZVision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.2026-01-30T17:33:16ZGuoxin LianShuo WangYucheng WangYongcai WangMaiyue ChenKaihui WangBo ZhangZhizhong SuDeying LiZhaoxin Fanhttp://arxiv.org/abs/2605.22183v3Action with Visual Primitives2026-06-13T04:46:06ZVision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 37.04% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.2026-05-21T08:52:47Z9 pages, 6 figures. Project page: https://kingdroper.github.io/AVP/Weilong GuoYuchen WangRenping ZhouYunfeng ZhangRui FangYuyang PangWenda XuGao Huanghttp://arxiv.org/abs/2606.15099v1Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models2026-06-13T04:16:18ZExisting Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.2026-06-13T04:16:18ZAccepted at ICML 2026Dianqiao LeiLianlei Shanhttp://arxiv.org/abs/2606.15068v1Design and Fabrication of a Spin Coater with In-Situ Optical Measurement for Soft Thin Films2026-06-13T02:55:27ZSpin coating is widely used for fabrication of thin polymer and elastomer films, yet reliable thickness verification of highly compliant materials remains challenging due to deformation from contact-based measurements and the cost and complexity of conventional optical metrology. Accurate thickness control is especially critical in soft elastomer applications such as dielectric elastomer actuators (DEAs), where mechanical and functional performance scales strongly with film thickness. This work presents a low-cost, primarily 3D-printed benchtop spin coater with an integrated, minimally deforming optical thickness measurement system for soft-film fabrication workflows. The system is designed to manufacture films between 50 and 300 microns thick with repeatability within 10 microns. Thickness is measured in-situ by tracking displacement of a reflected laser beam via quadrant photodetector, avoiding significant deformation. Optical geometry, sensor linearity constraints, and structural validation via finite element analysis are discussed. Experimental validation using calibrated metal shims demonstrated a thickness resolution of 3.6-3.7 microns and best-case measurement repeatability of 13 microns (95 percent confidence interval). The platform repeatably produced silicone films within 9 microns of target thickness, demonstrating that accessible optical metrology can be integrated into a low-cost spin coating system for practical, thickness-controlled fabrication of compliant thin films without specialized industrial instrumentation.2026-06-13T02:55:27Z8 pages, 7 figures, 5 tables. To be published in the conference proceedings for AIM 2026Daniel GliksbergJiajie QiuJun SuzukiKamal Youcef-Toumihttp://arxiv.org/abs/2606.15064v1Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering2026-06-13T02:45:10ZManipulation demonstrations have temporal phase structure, and a natural hypothesis is that demonstration-curation metrics should be applied within phases rather than globally. The idea is to segment each trajectory into phases, score each phase with the metric that is locally most informative, and then aggregate. This follows directly from prior work showing that a single global metric can be the best detector of a defect and yet the worst curator of the resulting policy. We test the per-phase hypothesis on three contact-rich LIBERO pick-and-place tasks with a controlled early-release structural defect, comparing phase-gated curation against the same metrics applied uniformly and against a strong single global metric. Across all three tasks and five random seeds per condition, phase-gated curation is never the best curation strategy, and it is the worst of the three on two of the three tasks (Task 1: 86.0 vs. 92.0 for global; Task 3: 22.7 vs. 48.0 for uniform). We trace the failure to a concrete mechanism. When the defect signal is concentrated in a single phase, rank-aggregating across phases dilutes that signal with uninformative scores from defect-free phases, selecting a worse demonstration subset than simply applying the defect-informative metric everywhere. We further show that the per-phase metric selection does not transfer across tasks, since no phase shares a winning metric between any two tasks, so the selection cannot be reused and must be re-derived per task from a noisy sweep. These results bound a plausible and previously untested method, and they argue that practitioners should prefer identifying a single defect-informative metric over decomposing curation by phase. We release the full pipeline, all metric implementations, and per-seed results.2026-06-13T02:45:10Z5 pages, 3 tables. Code: https://github.com/aaravbedi/phase-gated-curationAarav Bedihttp://arxiv.org/abs/2606.15046v1Exact, Efficient, and Safe Occlusion-Aware Planning Using AH-Polyhedrons2026-06-13T01:28:07ZSafely handling occlusions is a fundamental challenge for autonomous mobile robots operating in dynamic environments. This issue is especially prominent in autonomous valet parking (AVP), where traffic rules are lax, occlusions are frequent and cluttered, and overly conservative behavior can leave vehicles stuck. However, existing methods either lack formal safety guarantees, assume agents follow road structures, or introduce conservatism, leaving occlusion-aware planning for AVP an open challenge. In this paper, we propose APRO (AH-Polyhedron Reachability for Occlusions), an exact and efficient occlusion-aware planning framework based on game-theoretic active perception and AH-polyhedron reachability analysis with AVP as our canonical use case. Our key insight is to reformulate set-based safety conditions in prior work as unions of AH-polyhedrons, enabling exact safety verification through linear programming (LP) without any additional conservatism in set computations or assumptions on road topology. We further show how the resulting safety conditions can be integrated into optimization-based planners or a bisection search scheme for real-time applications. We validate our method in simulation and hardware experiments, including data replay on a real-world parking lot dataset. Experimental results demonstrate that our method consistently achieved a 100% safety rate across all evaluated scenarios while maintaining real-time performance, resulting in safer and more optimal decisions than existing methods with formal safety guarantees.2026-06-13T01:28:07Z8 pages, 3 figuresLong Kiu ChungDavid IseleToktam MohammadnejadFaizan M. TariqSangjae BaeShreyas KousikJovin D'sahttp://arxiv.org/abs/2606.15028v1An Autonomous Subgram SMA-Based Swimmer2026-06-12T23:51:50ZWe present the Swima, a bioinspired 900-mg swimmer propelled by two 10-mg high-work-density (HWD) actuators driven by shape-memory alloy (SMA) wires. We integrated onboard power and computation by using a custom-built printed circuit board (PCB) and an 11-mAh 3.7-V 507-mg single-cell lithium-ion (Li-Ion) battery, which in conjunction enable autonomous swimming in excess of 18 min. The Swima can swim at speeds of up to 22.4 mm/s (0.56 Bl/s), achieves turning rates of up to 14°/s, and can follow 0-degree heading reference trajectories with root mean square (RMS) values of tracking errors of about 6.5° across multiple tests. This robot is the first subgram microswimmer with onboard power, actuation, and computation developed to date.2026-06-12T23:51:50ZUnder review, 6 pages, 5 figuresConor K. TrygstadFrancisco M. F. R. GonçalvesNéstor O. Pérez-Arancibiahttp://arxiv.org/abs/2606.15021v1Steering Autoregressive Vision-Language-Action Policies via Action Token Intervention2026-06-12T23:37:01ZWe present Token Steering (TS), a method for dynamically steering trajectories generated by an autoregressive vision-language-action (VLA) model through direct intervention in the action-token space. TS injects low-dimensional user inputs into the model's native action-token representation, allowing users to influence trajectory generation without modifying the underlying vision-language model (VLM) architecture. Because TS operates entirely at inference time, it requires no additional training or finetuning. User inputs guide rather than override the pretrained policy, allowing users to influence robot actions while preserving the dexterity, smoothness, and task priors learned by the VLA. We evaluate TS on two household manipulation tasks -- drawer closing after object placement and state-aware object swapping -- and improve success rates from 10.0% to 72.5% and from 16.7% to 93.8%, respectively. By enabling lightweight, intuitive steering over robot foundation models, our interface has the potential to improve human-robot interaction in consumer environments and broaden accessibility for individuals with limited physical control. Project website: https://jasontchan.github.io/token-steering/ .2026-06-12T23:37:01Z9 pages, 5 figuresJason ChanJonathan C. Kao