Motion Tracks: A Unified Representation for
Human-Robot Transfer in Few-Shot Imitation Learning

1Cornell University, 2Stanford University


Motion-Track Policy (MT-π) presents a unified action space by representing actions as 2D trajectories on an image,
enabling it to directly imitate from cross-embodiment datasets with minimal amounts of robot demonstrations.


We evaluate MT-π across a suite of real-world tasks testing its robustness in policy execution
and the ability to generalize to scenes and motions only present in human demonstrations.


Abstract

Teaching robots to autonomously complete everyday tasks remains a persistent challenge. Imitation learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the slow, labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors.

We instantiate an IL policy called Motion Track Policy (MT‑π) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT‑π completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering full 6DoF trajectories via multi-view synthesis. MT‑π achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to novel scenarios seen only in human videos.


System Overview


We co-train MT‑π on human and robot demonstrations to predict the future pixel locations of keypoints on the end-effector (shown in red). For robot demonstrations, keypoints are extracted using calibrated camera-to-robot extrinsics, while human hand keypoints are obtained via HaMeR. To address embodiment differences, a Keypoint Retargeting Network maps robot keypoints to more closely resemble the human hand structure. The Motion Prediction Network, based on Diffusion Policy, takes image embeddings and current keypoints as input and predicts future keypoint tracks and grasp states. By operating entirely in image space, MT‑π directly learns actions from both robot and human demonstrations with a cross-embodiment action representation.


Data Collection

Extracting Motion Tracks

To collect robot demonstrations, we assume access to a workspace with $\geq1$ calibrated camera (with known camera-to-robot extrinsics) and robot proprioceptive states. For each demonstration, we capture a trajectory of images $I_t^{(i)}$ from each available viewpoint. Using the robot’s end-effector position and the calibrated extrinsics, we project the 3D position of the end-effector into the 2D image plane, yielding $k$ keypoints $s_t^{(i)} = \{(u_j^{(i)}, v_j^{(i)})\}_{j=1}^k$. In practice, we take $k = 5$, giving us two points per finger on the gripper, and one in the center. We choose this positioning of points as it lends itself better to gripper positioning for grasping actions. The gripper’s open/close state is represented as a binary grasp variable $g_t^{(i)} \in \{0, 1\}$.

Human demonstrations are collected using RGB cameras without needing access to calibrated extrinsics, making it possible to leverage large-scale human video datasets. We use HaMeR, an off-the-shelf hand pose detector, to extract a set of 21 keypoints $s_t^{(i)} = \{(u_j^{(i)}, v_j^{(i)})\}_{j=1}^{21}$. To roughly match the structure of the robot gripper, we select a subset of $k = 5$ keypoints: one on the wrist and two each on the thumb and index finger.

Extracting Grasps from Human Videos

To infer per-timestep grasp actions from human videos, we use a heuristic based on the proximity of hand keypoints to the object(s) being manipulated. For each task, we first obtain a pixel-wise mask of the object using GroundingDINOGroundingDINO and SAM 2. Then, if the number of pixels between the object mask and the keypoints on the thumb plus any one of the other fingertips falls below some threshold, we set $g_t^{(i)} = 1$. By loosely matching the positioning and ordering of keypoints between the human hand and robot gripper, we create an explicit correspondence between human and robot action representations in the image plane.


Inference


MT‑π represent actions as 2D image trajectories which are not directly executable on a robot. To bridge this, we predict motion tracks from two third-person camera views and treat them as pixelwise correspondences. Using stereo triangulation with known extrinsics, we recover 3D keypoints and compute the rigid transformation between consecutive timesteps. This yields a 6DoF trajectory, $a_{t:t+T}$, for robot execution. In practice, we use a much shorter prediction horizon ($H=16 \ll T$) for more closed-loop reasoning.


Evaluations

We evaluate MT‑π on a suite of table-top tasks against two commonly used image-based IL algorithms: Diffusion Policy (DP) and ACT.

  • MT‑π shares the diffusion backbone with DP but differs by training on cross-embodiment data and using an image-based motion-track action space, unlike the 6DoF proprioceptive action space of DP and ACT.
  • Unlike the baselines, MT‑π does not take wrist-camera observations as input, as these are typically absent in human videos.
  • These design choices are intended to attribute differences in policy performance to the training data distribution and action space employed by policies, rather than other factors.
Method Human and
Robot Data
Wrist
Camera Input
6DoF EE Delta
Action Space
Diffusion
Backbone
DP
ACT
MT-π

Low Robot-Data Regime

We consider 4 table-top manipulation tasks: folding a towel, placing a fork on a plate, serving an egg, and putting away socks. All algorithms are trained from 25 teleoperated robot demonstrations. MT‑π is provided an additional 10 minutes of human demonstrations.



Generalization to Novel Motions

A benefit of motion tracks as a representation is that they allow for positive transfer of motions captured in human demonstrations to an embodied agent. This is enabled by explicitly representing human motions within our action space, instead of only implicitly (i.e. via latent embeddings). To this end, we evaluate two variants of MT‑π (trained on human + robot data vs. robot data only) against DP and ACT for the task of closing a drawer.

Data

During data collection, we only collect demonstrations with the robot closing the drawer to the right. However, human videos include closing the drawer in both directions.

Robot Demos: Only closes the drawer to the right.

Human Demos: Closes the drawer in both directions.

Inference

While all policies successfully close the drawer when placed to the right, only MT-$\pi$ trained on human + robot data can generalize to closing the drawer to the left, as it directly leverages the action labels in image space from human demonstrations.

MT‑π (H + R): Generalizes human actions to the left.

MT‑π (Robot Only): No action labels going left.


Quantitative Results

Close Direction DP ACT MT-π (Robot Only) MT-π (H+R)
Left (In $D_{\text{robot}} \cup D_{\text{human}}$) 20/20 17/20 20/20 20/20
Right (Only in $D_{\text{human}}$) 0/10 0/10 0/10 18/20

How Much Data is Enough?

We evaluate MT‑π on a medium-complexity task to study the policy's performance under varying amounts of both human and robot data.


  • MT‑π without any human demonstrations matches the success rates of DP and ACT given the same amount of robot demonstrations, suggesting that predicting actions in image-space is a scalable action representation even with just robot data.
  • MT‑π matches the performance of baselines despite using 40% less minutes of robot demonstrations by leveraging ~10 minutes of human demonstrations.
  • Even for a fixed, small amount of teleoperated robot demonstrations, MT‑π can obtain noticeably higher policy performance simply by scaling up human video alone on the order of just minutes.

Failure Modes


Missed Grasps

While hand tracking is a fairly reliable part of our pipeline, detection of human grasps remains an open challenge. Consequently, though the general direction of motion tracks seem reasonable across experiments, the policy sometimes struggles to precisely grasp the desired object. We suspect our heuristic of leveraging foundation models to infer when hands and objects are in contact leads to some imprecision in ground truth human grasp labels, resulting in our policy occasionally prematurely or imprecisely grasping objects.


Grasp misses high and track continues to the right.

Gripper hits handle, knocking it to the side.



Track Disagreements

MT-π makes predictions on one image at a time and handles different viewpoints independently, thus it does not explicitly enforce consistency between tracks across separate views, which can lead to triangulation errors that produce imprecise actions. We demonstrate an extreme example of this using a task that requires the robot to close one drawer on the left, and one drawer on the right. Half the demonstrations close the demonstrations to the left first, and the other half close the drawer to the right first. Thus a possible outcome of training on this data is that one viewpoint predicts the robot closing the left drawer first, while the other predicts the robot closing the right drawer first.


Data: 1/2 close left first, 1/2 close right first.

Inference: Disagreement in predictions across views.

In this work, we try to ensure that teleoperated demonstrations are as unimodal as possible to encourage consistency in motion recovery. In the future, we can consider more explicitly enforcing viewpoint consistency via auxiliary projection/deprojection losses.


Common Failures of Baselines

We found DP and ACT baselines to generally move in the correct direction but often fail to grasp the object or complete the task. We suspect this is due to the low coverage over possible states by the small amount (~25 trajectories) of robot demonstrations provided in these experiments. In experiments where we scale up the amount of robot demonstrations or when the task is simple enough such that the reset distribution is fully covered, the performances of DP and ACT reach a much higher percentage.


ACT: Misses fork, right general direction.

DP: Cloth lifted too low, but right general direction.