Abstract

Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods either depend on robot-demonstrator paired data, which is infeasible to scale, or rely too heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically aligns robot and demonstrator task executions using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent demonstrator videos by retrieving and composing short-horizon demonstrator clips. This approach facilitates effective policy training without the need for paired data. We demonstrate that RHyME outperforms a range of baselines across cross-embodiment datasets, showing a 52% increase in task recall over prior cross-embodiment learning methods.

Introduction Figure
We introduce RHyME, a hierarchical framework that trains a robot policy to mimic a long-horizon video from a demonstrator that exhibits mismatched task execution. Our policy translates a demonstrator video into actions to complete the same task on a robot by "imagining" a paired dataset.

Real World Evaluations

Human Demo

RHyME

XSkill

Tasks: turn on light, move pot, drop cloth

Tasks: move pot, close drawer, drop cloth

Tasks: turn on light, move pot, close drawer

We compare our approach to the baseline of XSkill, which employs a self-supervised clustering algorithm that groups similar visual features to align human and robot video representations. However, this approach can struggle when there are significant differences in how tasks are performed. To overcome this, we redefine the visual imitation problem as a retrieval task during training. We match robot videos to the most similar human video segments from an unpaired play dataset, allowing the model to create synthetic demonstration videos for training.

Real World Eval
Realworld Results. (Left) Task Embeddings: We use t-SNE to visualize cross-embodiment latent embeddings from the human and robot completing three tasks. (Right) Task Completion: We compare the performance of RHyME with XSkill on seen and unseen long-horizon tasks specified by human prompt videos. Opaque segments indicate Task Completion rate, and augmented transparent bars indicate Task Attempt rate. Our method RHyME outperforms XSkill in seen and unseen tasks in the real world.

Challenging Demonstrators in Simulation

We present results on three datasets. As the demonstrator's actions visually and physically deviate further from those of the robot, policies trained with our framework RHyME consistently outperform XSkill in a simulation setting.

Introduction Figure
We measure Task Recall, which assesses proportion of successfully completed tasks out of all attempted tasks. As the execution becomes increasingly mismatched, RHyME is able to maintain relatively higher rates of task recall compared to XSkill.
Introduction Figure
We also measure Task Imprecision, which reports the percentage of incorrect tasks the robot attempts, i.e. those that were not specified by the human prompt. When attempting the hard scenario with mismatch in appearance, execution, and manual dexterity, RHyME achieves significantly lower imprecision rates than XSkill.

Paper

BibTex

@misc{kedia2024oneshotimitationmismatchedexecution,
  title={One-Shot Imitation under Mismatched Execution}, 
  author={Kushal Kedia and Prithwish Dan and Angela Chao and Maximus Adrian Pace and Sanjiban Choudhury},
  year={2024},
  eprint={2409.06615},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2409.06615}, 
}