Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks

Abstract

Robot arms should be able to learn new tasks. One framework here is reinforcement learning, where the robot is given a reward function that encodes the task, and the robot autonomously learns actions to maximize its reward. Existing approaches to reinforcement learning often frame this problem as a Markov decision process, and learn a policy (or a hierarchy of policies) to complete the task. These policies reason over hundreds of fine-grained actions that the robot arm needs to take: e.g., moving slightly to the right or rotating the end-effector a few degrees. But the manipulation tasks that we want robots to perform can often be broken down into a small number of high-level motions: e.g., reaching an object or turning a handle. In this paper we therefore propose a waypoint-based approach for model-free reinforcement learning. Instead of learning a low-level policy, the robot now learns a trajectory of waypoints, and then interpolates between those waypoints using existing controllers. Our key novelty is framing this waypoint-based setting as a sequence of multi-armed bandits: each bandit problem corresponds to one waypoint along the robot's motion. We theoretically show that an ideal solution to this reformulation has lower regret bounds than standard frameworks. We also introduce an approximate posterior sampling solution that builds the robot's motion one waypoint at a time. Results across benchmark simulations and two real-world experiments suggest that this proposed approach learns new tasks more quickly than state-of-the-art baselines.

How Does it Work?

We frame model-free reinforcement learning as a continuous multi-arm bandit problem. Each bandit problem corresponds to one waypoint along the robot's learned trajectory. The lower bounds on the regret for this multi-arm bandit setting derived in the manuscript show that waypoint based reinforcement learning is linear in time complexity, i.e., the time needed to train for a given task increases linearly with the increase in the number of waypoints needed to solve the task.
The robot arm learns to place the next waypoint to maximize its reward by solving a multi-arm bandit problem. Once the robot has learned to correctly place a waypoint i, we freeze the learned models for that waypoint. To learn the next waypoint, the robot rolls out the trajectory for waypoints 1 to i using the saved models and then repeats the learning process for waypoint i+1.

Benchmark Simulations

We evaluate the performance of our proposed approach in Robosuite, a simulated robot environment with a set of standard manipulation tasks for robot arms. Across these benchmark manipulation tasks, we compare the performance of our proposed approach to that of standard reinforcement learning algorithms SAC and PPO, as well as extended versions of these standard algorithms SAC-wp and PPO-wp. Below we show a video comparing the performance of Ours to that of SAC (the best performning baseline). The detailed resluts and comparisons with other baselines are discussed in the manuscript.

Real-World Experiments

To evaluate if our proposed algorihtm can be applied to real world settings, i.e. trainig from scratch in the real world, we train a 7 DoF Franka Emika robot arm to perform two tasks in the real world. The first task is similar to the task in the simulation environment where the robot needs to pick up a block placed on a table. In the sencond task, the robot needs to open a drawer placed in its workspace. In this real world setting, we compare the performance of our proposed approach to that of the best performing baseline from the simulation SAC. The videos for the performance of the robot across both the tasks are shown below.

Picking-up a Block

Opening a Drawer

@article{mehta2024waypoint, title={Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks}, author={Mehta, Shaunak A and Habibian, Soheil and Losey, Dylan P}, journal={arXiv preprint arXiv:2403.13281}, year={2024} }