To collect robot demonstrations, we assume access to a workspace
with $\geq1$ calibrated camera (with known camera-to-robot
extrinsics) and robot proprioceptive states. For each
demonstration, we capture a trajectory of images $I_t^{(i)}$
from each available viewpoint. Using the robot’s end-effector
position and the calibrated extrinsics, we project the 3D
position of the end-effector into the 2D image plane, yielding
$k$ keypoints $s_t^{(i)} = \{(u_j^{(i)}, v_j^{(i)})\}_{j=1}^k$.
In practice, we take $k = 5$, giving us two points per finger on
the gripper, and one in the center. We choose this positioning
of points as it lends itself better to gripper positioning for
grasping actions. The gripper’s open/close state is represented
as a binary grasp variable $g_t^{(i)} \in \{0, 1\}$.
Human demonstrations are collected using RGB cameras without
needing access to calibrated extrinsics, making it possible to
leverage large-scale human video datasets. We use
HaMeR, an off-the-shelf hand pose detector, to extract a set of 21
keypoints $s_t^{(i)} = \{(u_j^{(i)}, v_j^{(i)})\}_{j=1}^{21}$.
To roughly match the structure of the robot gripper, we select a
subset of $k = 5$ keypoints: one on the wrist and two each on
the thumb and index finger.
To infer per-timestep grasp actions from human videos, we use a
heuristic based on the proximity of hand keypoints to the
object(s) being manipulated. For each task, we first obtain a
pixel-wise mask of the object using
GroundingDINOGroundingDINO and
SAM 2.
Then, if the number of pixels between the object mask and the
keypoints on the thumb plus any one of the other fingertips
falls below some threshold, we set $g_t^{(i)} = 1$. By loosely
matching the positioning and ordering of keypoints between the
human hand and robot gripper, we create an explicit
correspondence between human and robot action representations in
the image plane.