X-Sim learns image-conditioned policies from human videos without any robot teleoperation data
Real Robot Rollouts
We evaluate X-Sim over 5 different manipulation tasks, demonstrating its ability to learn diverse behaviors from human demonstration videos.
Abstract
Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms.
Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly.
We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies.
X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards.
These rewards are used to train a reinforcement learning (RL) policy in simulation.
The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting.
To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment.
Importantly, X-Sim does not require any robot teleoperation data.
We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes.
Approach Overview
We propose X-Sim, a real-to-sim-to-real framework that bridges the embodiment gap between humans and robots. Our approach uses object motion as a dense and transferable signal for learning robot policies, eliminating the need for robot teleoperation data.
1. Real-to-Sim
X-Sim first reconstructs a photorealistic simulation environment using a video scan via phone. Given a demonstration RGB-D human video, X-Sim tracks object trajectories to define object-centric rewards. These rewards are used to train RL policies in simulation.
2. Sim-to-Real
The trained RL policy generates synthetic data by executing rollouts with randomized viewpoints, lighting, and object states in simulation. This data is used to train an image-conditioned policy that can operate directly from real-world camera inputs.
3. Online Domain Adaptation
During deployment, X-Sim calibrates its visual encoder by replaying real-world robot actions in simulation, creating paired real and simulated views of the same trajectories.
Key Results
X-Sim crosses the embodiment gap
Humans and robots have fundamentally different physical embodiments. Attempting to directly map human hand motions to robot actions through inverse kinematics (IK) is unsuccessful because robots cannot achieve the same joint configurations. Instead of forcing this direct mapping, X-Sim uses RL in simulation to discover robot-specific strategies that achieves the same object motion as the human.
X-Sim enables faster data collection
Human demonstrations can be collected much faster than robot teleoperation data needed for behavior cloning methods like diffusion policies. X-Sim further amplifies this advantage by generating diverse synthetic data through simulation - varying viewpoints, lighting, and object positions. This combination of rapid human data collection and rich synthetic augmentation enables X-Sim to match the performance of behavior cloning with 10x less data collection time.
Paper
BibTex
@article{dan2025xsim,
title={X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real},
author={Prithwish Dan and Kushal Kedia and Angela Chao and Edward Weiyi Duan and Maximus Adrian Pace and Wei-Chiu Ma and Sanjiban Choudhury},
year={2025},
eprint={2505.07096},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.07096},
}