Visual Model-Based Reinforcement Learning as a Path towards Generalist Robots

With very little explicit supervision and feedback, humans are able to learn a
wide range of motor skills by simply interacting with and observing the world
through their senses. While there has been significant progress towards building
machines that can learn complex
and learn
based on raw sensory information
such as image pixels, acquiring large and
diverse repertoires of general skills remains an open challenge. Our goal is
to build a generalist: a robot that can perform many different tasks, like
arranging objects, picking up toys, and folding towels, and can do so with many
different objects in the real world without re-learning for each object or task.
While these basic motor skills are much simpler and less impressive than mastering Chess or even using a spatula, we think that
being able to achieve such generality with a single model is a fundamental
aspect of intelligence.

The key to acquiring generality is diversity. If you deploy a learning
algorithm in a narrow, closed-world environment, the agent will recover skills
that are successful only in a narrow range of settings. That’s why an algorithm
trained to play Breakout will struggle when anything about the images or the
game changes. Indeed, the success of image classifiers relies on large, diverse
datasets like ImageNet. However, having a robot autonomously learn from large
and diverse datasets is quite challenging. While collecting diverse sensory data
is relatively straightforward, it is simply not practical for a person to
annotate all of the robot’s experiences. It is more scalable to collect
completely unlabeled experiences. Then, given only sensory data, akin to what
humans have, what can you learn? With raw sensory data there is no notion of
progress, reward, or success. Unlike games like Breakout, the real world doesn’t
give us a score or extra lives.

We have developed an algorithm that can learn a general-purpose predictive model
using unlabeled sensory experiences, and then use this single model to perform a
wide range of tasks.

With a single model, our approach can perform a wide range of tasks, including
lifting objects, folding shorts, placing an apple onto a plate, rearranging
objects, and covering a fork with a towel.

In this post, we will describe how this works. We will discuss how we can learn
based on only raw sensory interaction data (i.e. image pixels, without requiring
object detectors or hand-engineered perception components). We will show how we
can use what was learned to accomplish many different user-specified tasks. And,
we will demonstrate how this approach can control a real robot from raw pixels,
performing tasks and interacting with objects that the robot has never seen

Learning to Predict from Unsupervised Interaction

We first need a means to collect diverse data. If we train the robot to
perform a single skill with a single object instance, i.e. using a particular
hammer to hit a particular nail, then it will only learn about that narrow
setting; that particular hammer and nail is its entire universe. How can we
build robots that learn more general skills? Instead of learning a single task
in a narrow environment, we can have robots learn on their own, in diverse
environments, akin to a child
playing and exploring

If a robot can collect data on its own and learn from that experience completely
autonomously, then it doesn’t require a person to supervise and can hence
collect experience and learn about the world at any time of day, even overnight!
Further, multiple robots can collect data simultaneously and share their
experiences – data collection is scalable, hence making it practical to
collect diverse data with many objects and motions. To implement this, we had
two robots collect data in parallel by taking random actions with a wide range
of objects, both rigid objects like toys and cups, and deformable objects like
cloth and towels:

Two robots interact with the world, collecting data
autonomously with many objects and many motions.

In the data collection process, we observe what the robot’s sensors measure: the
image pixels (vision), the position of the arm (proprioception), and the motor
commands sent to the robot (action). We cannot directly measure the positions of
the objects, how they react to being pushed, their speed, etc. Further, in this
data, there is no notion of progress or success. Unlike a game of Breakout or
hammering a nail, we don’t get a score or an objective. All we have to learn
from, when interacting in the real world, is what is provided by our senses,
or in this case, the robot’s sensors.

So, what can we learn, when only given our senses? We can learn to predict
— what will the world look like, or feel like, if the robots moves its arm in
one way versus in another way?

The robot learns to predict what the future will look like if it moves
its arm in different ways, learning about physics, objects, and itself.

Prediction allows us to learn general things about the world, things like
objects and physics. And such general-purpose knowledge is exactly what the
Breakout-playing agent is missing. Prediction also allows us to learn from all
of the data that we have: a stream of actions and images has a lot of implicit
supervision. This is critical because we don’t have a score or reward function.
Model-free reinforcement learning systems typically only learn from the
supervision provided from the reward function, whereas model-based RL agents
utilize the rich information available in the pixels they observe. Now, how do
we actually use these predictions? We will discuss this next.

Planning to Perform Human-Specified Tasks

If we have a predictive model of the world, then we can use it to plan to
achieve goals. That is, if we understand the consequences of our actions, then
we can use that understanding to choose actions that lead to the desired
outcome. We use a sampling-based procedure to plan. In particular, we sample
many different candidate action sequences, then select the top plans—the
actions that are most likely to lead to the desired outcome—and refine our
plan iteratively, by resampling from a distribution of actions fitted to the top
candidate action sequences. Once we come up with a plan that we like, we then
execute the first step of our plan in the real world, observe the next image,
and then replan in case something unexpected happened.

A natural question now is—how can a user specify a goal or desired outcome to
the robot? We have experimented with a number of different ways to do so. One of
the easiest mechanisms that we have found is to simply click on a pixel in the
initial image and specify where the object corresponding to that pixel should be
moved, by clicking another pixel position. We can also give more than one pair
of pixels to specify other desired object motions. While there are types of
goals that cannot be expressed in this way (and we have explored more versatile
goal specifications, such as goal classifiers), we have found that specifying
pixel positions can be used to describe a wide variety of tasks and is
remarkably easy to provide. To be clear, these user-provided goal specifications
are not used during the data collection, when the robot is interacting with the
world—they are only used at test-time per se, when we want the robot to use
its predictive model to accomplish a certain goal.


We experiment with this overall approach on a Sawyer robot, collecting 2 weeks
of unsupervised experience. Critically, the only human involvement during
training is providing a diverse range of objects for the robot to interact with
(swapping out objects periodically) and coding the random robot motions that are
used to collect data. This allows us to collect data on multiple robots nearly
24 hours a day, with very little effort. We train a single action-conditioned
video prediction model on all of this data, including two camera viewpoints, and
use the iterative planning procedure described previously to plan and execute on
user-specified tasks.

Since we set out to achieve generality, we evaluate the same predictive
model on a wide range of tasks involving objects that the robot has never seen
before and goals the robot has not encountered previously.

For example, we ask the robot to fold shorts:

Left: The goal is to fold the left side of the shorts. Middle: the robot’s
prediction corresponding to its plan. Right: the robot performs its plan.

Or put an apple on a plate:

Left: The goal is to put the apple on the plate. Middle: the robot’s
prediction corresponding to its plan. Right: the robot performs its plan.

Finally, we can also ask the robot to cover a spoon with a towel:

Left: The goal is to cover the spoon with the towel. Middle: the robot’s
prediction corresponding to its plan. Right: the robot performs its plan.

Interestingly, we find that, even though the model’s predictions are far from
perfect, it can still use them to effectively accomplish the specified goal.

There have been many
prior works that approach the problem of
model-based reinforcement learning (RL), i.e. learning a predictive model, and
then using this model to act or using it to learn a policy. Many of such prior
works have focused on settings where the the positions of objects or other
task-relevant information can be accessed directly—rather than through images
or other raw sensor observations. Having this low-dimensional state
representation is a strong assumption that is often impossible to fulfill in the
real world . Model-based RL methods that directly operate on raw image frames
have not been studied as extensively. Several algorithms have been proposed for
simple, synthetic images and video game environments, which have focused on
a fixed set of objects and tasks. Other work has studied model-based RL in the
real world, again focusing on individual skills.

A number of recent works have studied self-supervised robotic learning, where
large-scale unattended data collection is used to learn individual skills such
as grasping (e.g. see these works), push-grasp synergies, or obstacle avoidance. Our approach is also
fully self-supervised; in contrast with these approaches, we learn a predictive
model that is goal-agnostic and can be used to perform a variety of manipulation


Generalization to many distinct tasks in visually diverse settings is arguably
one of the biggest challenges in reinforcement learning and robotics research
today. Deep learning has greatly reduced the amount of task-specific engineering
needed to deploy an algorithm; however, prior methods typically require
extensive amounts of supervised experience or focus on mastery of individual
tasks. Our results suggest that our approach can generalize to a wide range of
tasks and objects, including those never seen previously. The generality of the
model is the result of large-scale self-supervised learning from interaction. We
believe the results represent a significant step forward in terms of
generality of tasks achieved by a single robotic reinforcement learning

This work in this post is based on the following paper:

The above paper is an extended version of the following four papers, and
builds upon the fifth paper