VIEW: Visual Imitation Learning with Waypoints

Ananth Jonnavittula      Sagar Parekh      Dylan Losey
Collab, Virginia Tech Under Review

Robots Learning from a single video demonstration from humans.


Robots can use Visual Imitation Learning (VIL) to learn everyday tasks from video demonstrations. However, translating visual observations into actionable robot policies is challenging due to the high-dimensional nature of video data. This challenge is further exacerbated by the morphological differences between humans and robots, especially when the video demonstrations feature humans performing tasks. To address these problems we introduce Visual Imitation lEarning with Waypoints (VIEW), an algorithm that significantly enhances the sample efficiency of human-to-robot VIL. VIEW achieves this efficiency using a multi-pronged approach: extracting a condensed prior trajectory that captures the demonstrator's intent, employing an agent-agnostic reward function for feedback on the robot's actions, and utilizing an exploration algorithm that efficiently samples around waypoints in the extracted trajectory. VIEW also segments the human trajectory into grasp and task phases to further accelerate learning efficiency. Through comprehensive simulations and real-world experiments, VIEW demonstrates improved performance compared to current state-of-the-art VIL methods. VIEW enables robots to learn a diverse range of manipulation tasks involving multiple objects from arbitrarily long video demonstrations. Additionally, it can learn standard manipulation tasks such as pushing or moving objects from a single video demonstration in under 30 minutes, with fewer than 20 real-world rollouts.

How Does it Work?

VIEW, our proposed method for human-to-robot visual imitation learning, begins with a single video demonstration of a task. From this video, we extract the object of interest, its trajectory, and the human's trajectory. Compression yields a trajectory prior, a sequence of waypoints for the robot arm. This initial trajectory is often imprecise due to human-robot differences and extraction noise. We refine the prior using a residual network trained on previous tasks to de-noise the data. The de-noised trajectory is segmented into grasp exploration and task exploration. During grasp exploration, the robot modifies the pick point to determine how to pick up the object. After a successful grasp, the robot corrects the remaining waypoints during task exploration. The robot synthesizes a complete trajectory, which, along with the prior trajectory, is used to further train the residual network, enhancing future performance.

How Do We Extract Human Priors?

An overview of our prior extraction method. We start with identifying the hand's location and its contact with objects using the 100 Days of Hands (100DOH) detector. We refine the human's hand trajectory with the MANO model to capture wrist movements. To eliminate redundancy, we apply the SQUISHE algorithm, producing an initial trajectory with key waypoints for the robot. To identify the object of interest amid clutter, we analyze frames with hand-object contact, creating anchor boxes that, combined with an object detector, reveal the object the human interacts with most frequently. This allows us to construct an accurate object trajectory from the video.


      title={VIEW: Visual Imitation Learning with Waypoints}, 
      author={Ananth Jonnavittula and Sagar Parekh and Dylan P. Losey},
      journal={arXiv preprint arXiv:2404.17906},


We thank Heramb Nemlekar for his feedback on our manuscript. This work was supported by the USDA National Institute of Food and Agriculture, Grant 2022-67021-37868.