Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought

NeurIPS 2023

Huaxiaoyue Wang, Gonzalo Gonzalez-Pumariega, Yash Sharma, Sanjiban Choudhury

Cornell University

Demo2Code converts language and demonstrations to task code that the robot can execute. The framework recursively summarizes both down to a task specification, then recursively expands the specification to an executable task code.

Abstract

Language instructions and demonstrations are two natural ways for users to teach robots personalized tasks. Recent progress in Large Language Models (LLMs) has shown impressive performance in translating language instructions into code for robotic tasks. However, translating demonstrations into task code continues to be a challenge due to the length and complexity of both demonstrations and code, making learning a direct mapping intractable.

This paper presents Demo2Code, a novel framework that generates robot task code from demonstrations via an extended chain-of-thought and defines a common latent specification to connect the two. Our framework employs a robust two-stage process: (1) a recursive summarization technique that condenses demonstrations into concise specifications, and (2) a code synthesis approach that expands each function recursively from the generated specifications. We conduct extensive evaluation on various robot task benchmarks, including a novel game benchmark Robotouille, designed to simulate diverse cooking tasks in a kitchen environment.

Method Overview

Demo2Code generates robot task code from language instructions and demonstrations through a two-stage process.

(1) Recursive Summarization: summarize demonstrations to task specifications

In stage 1, the LLM first summarizes each demonstration individually. Once all demonstrations are sufficiently summarized, they are then jointly summarized in the final step as the task specification.

In the example, the LLM is asked to perform some intermediate reasoning (e.g. identifying the order of the high-level action) before outputting the specification (starting at "Make a burger...")

(2) Recursive expansion: synthesize code from the task specification

In stage 2, given a task specification, the LLM first generates high-level task code that can call undefined functions. It then recursively expands each undefined function until eventually terminating with only calls to the existing APIs imported from the robot's low-level action and perception libraries.

In the example, the function cook_obj_at_loc is an initially undefined function that the LLM calls when it first generates the high-level task code. In contrast, the function move_then_pick is a function that only uses existing available APIs.

Robotouille: Kitchen Simulator

Open-source game!
Procedurally generated and configurable environments
Endless customization with new assets, tasks and actions
Reasons in high-level task planning space

Demo2Code generalizes to multiple tasks!

We can successfully complete various tasks like cooking, tabletop manipulation and dish washing while accommodating to a user's preferences.

Demo2Code is compared against two other methods.

Lang2Code: a prior work CodeAsPolicies, which generates code only from language instruction
DemoNoLang2Code: an ablation method, which generates code from demonstrations only without a language instruction

Demo2Code can generate accurate policies for language instructions with preferences implicit in demonstrations

Demo2Code + Robotouille

Gif showing the robot making a burger with a patty and a lettuce in Robotouille simulator

Make a burger (with a patty and lettuce)

Gif showing the robot making a burger with a patty and a lettuce and a tomato in Robotouille simulator

Make a burger (with patty, lettuce and tomato)

Gif showing the robot making a burger with a patty and cheese in Robotouille simulator

Make a burger (with a patty and cheese)

Tabletop Manipulation Simulator

Place purple cylinder next to green block (to its left)

Place blue block on red cylinder (by unstacking gray and yellow blocks)

Stack all objects into two stacks (by category)

Demo2Code can ground ambiguous language instruction.

In the first example here, Demo2Code successfully extracts specificity in tabletop tasks. Although the language instruction just ambiguously says "next to", it correctly infers from the goal is "left of" from the demonstrations. Other baselines and methods fail to infer this.

Capturing Real-World User Preferences in EPIC-Kitchens

Demo2Code can apply to real-world video demonstrations and identify different users' styles.

User 22 to first scrub objects one at a time, then rinse them one by one. In contrast, User 30 prefers to scrub and rinse each object one by one.
Demo2Code can capture and translate these implicit preferences to code.

Paper

Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought

Huaxiaoyue Wang, Gonzalo Gonzalez-Pumariega, Yash Sharma, Sanjiban Choudhury

@inproceedings{
wang2023democode,
title={Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought},
author={Huaxiaoyue Wang and Gonzalo Gonzalez-Pumariega and Yash Sharma and Sanjiban Choudhury},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=ftPoVcm821}
}

Acknowledgements

We sincerely thank Nicole Thean (@nicolethean) for creating our art assets for Robotouille!