L2D2: Robot Learning from 2D Drawings

Virginia Tech
Under Review

Abstract

Robots should learn new tasks from humans. But how do humans convey what they want the robot to do? Existing methods largely rely on physical demonstrations, where the human teleoperates or kinesthetically guides the robot arm throughout their intended task. Unfortunately --- as we scale up the amount of human data --- physical guidance becomes prohibitively burdensome. Not only do teachers need to limit their motions to hardware constraints, but humans must also modify the environment (e.g., moving and resetting objects) to provide multiple task examples. In this work we therefore propose L2D2, a sketching interface and imitation learning algorithm where human teachers can provide demonstrations by drawing the task. L2D2 starts with a single third-person image of the robot arm and its workspace. Using a tablet, users draw and label trajectories on this image to illustrate how the robot should act. To collect new and diverse demonstrations, we no longer need the human to physically reset the workspace; instead, L2D2 leverages vision and language segmentation to autonomously vary object locations and generate synthetic images for the human to draw upon. We recognize that drawing trajectories is not as information-rich as physically demonstrating the task. Drawings are 2-dimensional (while the intended task occurs in our 3-dimensional world), and static drawings do not capture how the robot's actions affect its environment (e.g., pushing an object). To address these fundamental challenges the next stage of L2D2 grounds the human's static, 2D drawings in our dynamic, 3D world by leveraging a small set of physical demonstrations. Our experiments and user study suggest that L2D2 enables humans to provide more demonstrations with less time and effort than traditional approaches, and users prefer drawings over physical manipulation. When compared to other drawing-based approaches, we find that L2D2 learns more performant robot policies, requires a smaller dataset, and can generalize to longer-horizon tasks.

Video

L2D2: Overview




We propose L2D2, an approach that enables novice end-users to teach diverse tasks to the robot by providing drawings while minimizing thier physical interaction with the environment. We start with an image of the evnironment, where the users use language prompts to highlight the objects that they want to interact with. Our apprach leverages VLMs to identify these objects and automatically generates synthetic images covering diverse task configurations. The users can draw on these diverse images to teach the robot with physically interacting with it. If the robot makes an error when executing the task learned from drawings, the user can physically correct its behavior. These inputs by the users in the real world are treated as new demonstration data for the task. We use these physcial demonstrations to improve the trajectories reconstructed from drawings, followed by fine-tuning the robot policy with the updated reconstructions and the few real-world demonstrations. The diverse drawings teach the robot to generalize to different task settings, while the few physical demonstrations ground the drawings in the real world to enable the robot to perform the task accurately.



Interface

Our interface consists of three parts. The first part shows an image of the environment with the robot and the objects for the task. Users can start drawing by selecting the "START" button. The drawing provided by the user show the general trajectory they want the robot to follow. In the second part, the users can select the points where they want to toggle the gripper of the robot by selecting the respective buttons. Finally, to enable users to select the orientation of the gripper, we provide a 3D visualization of the gripper that rotates in real-time with user inputs. Using this visualization and the three sliders on the interface the users can select how they want the robot to change its orientation at different stages of the task.


BibTeX