In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.
We release a high-quality dataset collected using a motion capture system, consisting of human-human and human-robot episodes of collaboration to perform daily household activities.
Overview of our framework InteRACT,which predicts human intent conditioned on future robot actions for collaborative manipulation tasks. At train time, we first pre-train a conditional intent prediction model on human-human interaction data combining publicly available datasets and task specific datasets that we collect. We then fine-tune this model on a small scale human-robot dataset where we predict human intent conditioned on robot actions. Our approach has two main features: (1) an alignment loss between human and robot representations to allow transfer between domains (2) a new tele-operation technique to control a 7-DoF robot arm for paired human-robot interaction.