Stable-BC: Controlling Covariate Shift with Stable Behavior Cloning

Robot Learning to play Air Hockey with 15 seconds of training data

Behavior cloning is a common imitation learning paradigm. Under behavior cloning the robot collects expert demonstrations, and then trains a policy to match the actions taken by the expert. This works well when the robot learner visits states where the expert has already demonstrated the correct action; but inevitably the robot will also encounter new states outside of its training dataset. If the robot learner takes the wrong action at these new states it could move farther from the training data, which in turn leads to increasingly incorrect actions and compounding errors. Existing works try to address this fundamental challenge by augmenting or enhancing the training data. By contrast, in our paper we develop the control theoretic properties of behavior cloned policies. Specifically, we consider the error dynamics between the system's current state and the states in the expert dataset. From the error dynamics we derive model-based and model-free conditions for stability: under these conditions the robot shapes its policy so that its current behavior converges towards example behaviors in the expert dataset. In practice, this results in Stable-BC, an easy to implement extension of standard behavior cloning that is provably robust to covariate shift. We demonstrate the effectiveness of our algorithm in simulations with interactive, nonlinear, and visual environments. We also conduct experiments where a robot arm uses Stable-BC to play air hockey.

Stable-BC leverages offline demonstrations provided by experts to learn a behavior cloned policy that is robust to covariate shift. We model the covariate shift in the robot and environment states as a linearized dynamical system. When this dynamical system is stable, the robot's learned behavior locally converges towards the behavior of the human demonstrations, thus mitigating covariate shift. However, only stabilizing this dynamical system does not guarantee that the robot will solve the task successfully. We therefore train the robot to match the expert's demonstrations using the standard behavior cloning loss function, and then add a second loss that stabilizes the dynamical system. This leads to a learned policy that performs the task successfully while also converging towards the expert behaviors.

This simulation involves two vehicles trying to cross an intersection. One car is controlled by a simulated human while the other is an autonomous agent. The goal of the autonomous car is to reach the goal position while avoiding collision with the simulated human driver. Below we show some behaviors executed by the autonomous agent using the policies learned using BC and Stable-BC.

We then evaluate our approach in higher-dimesnaional settings with nonlinear system dynamics. A quadrotor has to navigate through a room to reach the goal location. Below we show some behaviors executed by the quadrotor while navigating around randomly placed obstacles in the room to reach the goal position using policies learned using BC and Stable-BC.

In this simulation, we evaluate our approach for visual learning settings. The robot receives an image with the goal location as an observation of the enviornment. The robot needs to figure out the correct actions that will lead it to reach the goal location. Below we show example input images and the behavior of the robot executed be the policies learned using BC and Stable-BC.

In this experiment, a 7DoF Franka Emika robot arm learns to play a simplified game of air hockey from human demonstrations. The participants use a 2DoF joystick to control the position of the robot on the air hockey table. Each participant is given a practice time of 2 minutes before starting to provide the demonstrations. The demonstrations from each participant includes data collected over ~2.5 minutes of play time. The data is recorded at a frequency of 20 Hz, generating ~3000 datapoints in ~2.5 minutes of play. The snippets of the participants playing air hockey to provide demonstrations to the robot are shown below.

The robot has access to its own state and the state of the puck on the air hockey table, but it does not have access to the dynamics of the puck, i.e., how the motion of the puck will change given the robot's actions. We train the policies for BC and Stable-BC using substets of data from the demonstrations collected from the users. Across 15 seconds, 60 seconds and 120 seconds of data, we see that Stable-BC hits the puck more times and has a better performance than BC.

60 seconds of training data

120 seconds of training data

In addition to the experiments provided in the paper, we also compare our approach to an Offline-RL algorithm (CQL) in the intersection environment. Our results suggest that Stable-BC converges to demonstrator level performance with fewer demonstrations as compared to CQL. However, when both approaches have access to large number of demonstrations, Offline-RL may outperform Stable-BC. This trend is observed due to the fact that Offline-RL has access to the reward funcitons in addition to expert demonstrations. While Stable-BC only leverages the noisy imperfect data provided by the demonstrator to learn a policy, CQL overcomes these imperfect demonstrations as it has access to the reward function during training. This suggests that when we have access to a large number of demonstrations along with the reward functions for the task, Offline-RL can be used to learn a policy that can outperform the demonstrator. However, when the robot has access to a few demonstrations or the reward function for the task is not readily available, Stable-BC can be leveraged to learn a robust policy that can match the demonstrator performance.

BibTeX

@article{mehta2024stable,
      title={Stable-BC: Controlling Covariate Shift with Stable Behavior Cloning},
      author={Mehta, Shaunak A and Ciftci, Yusuf Umut and Ramachandran, Balamurugan and Bansal, Somil and Losey, Dylan P},
      journal={arXiv preprint arXiv:2408.06246},
      year={2024}
    }

Stable-BC: Controlling Covariate Shift with Stable Behavior Cloning

Robot Learning to play Air Hockey with 15 seconds of training data

Abstract

Video

How Does it Work?

Interactive Driving Simulation

Quadrotor Simulation with Nonlinear Dynamics

Point Mass Simulation with Visual Observations

Air Hockey Experiments

60 seconds of training data

120 seconds of training data

Comparison to Offline-RL

BibTeX