Drilling Down on Depth Sensing and Deep Learning



Top left: image of a 3D cube. Top right: example depth image, with darker points
representing areas closer to the camera (source: Wikipedia). Next two
rows: examples of depth and RGB image pairs for grasping objects in a bin. Last
two rows: similar examples for bed-making.

This post explores two independent innovations and the potential for combining
them in robotics. Two years before the AlexNet results on ImageNet
were released in 2012, Microsoft rolled out the Kinect for the X-Box. This class
of low-cost depth sensors emerged just as Deep Learning boosted
Artificial Intelligence by accelerating performance of hyper-parametric function
approximators leading to surprising advances in image classification,
speech recognition, and language translation. Today, Deep Learning is
also showing promise for end-to-end learning of playing video games and
performing robotic manipulation tasks.

For robot perception, convolutional neural networks (CNNs), such as
VGG or ResNet, with three RGB color channels have become standard. For
robotics and computer vision tasks, it is common to borrow one of these
architectures (along with pre-trained weights) and then to perform transfer
learning or fine-tuning
on task-specific data. But in some tasks, knowing
the colors in an image may provide only limited benefits. Consider training a
robot to grasp novel, previously unseen objects. It may be more important to
understand the geometry of the environment rather than colors and textures. The
physical process of manipulation — controlling one or more objects by applying
forces through contact — depends on object geometry, pose, and other factors
which are largely color-invariant. When you manipulate a pen with your hand, for
instance, you can often move it seamlessly without looking at the actual pen, so
long as you have a good understanding of the location and orientation of contact
points. Thus, before proceeding, one might ask: does it makes sense to use
color images?

There is an alternative: depth images. These are single-channel grayscale
images that measure depth values from the camera, and give us invariance to the
colors of objects within an image. We can also use depth to “filter” points
beyond a certain distance which can remove background noise, as we demonstrate
later with robot bed-making. Examples of paired depth and real images are shown

In this post, we consider the potential for combining depth images and deep
learning in the context of three ongoing projects in the UC Berkeley
: Dex-Net for robot grasping, segmenting objects in heaps, and robot

Sensing Depth

Depth images encode distance (e.g., in millimeters) of surfaces in a scene
relative to a particular viewpoint. We provide an example in the image at the
top of this post. On the top left is an RGB image of a 3D cube structure, which
has points located at a variety of distances from the camera. To the top right
is one representation of a depth image, with darker points representing closer
surfaces, though it is also valid to use other representations, such as using
darker points for farther areas, or to use depth with respect to a different
origin. For additional background on how depth images can be created, check out
this blog post by the Comet Labs Research Team

Recent Advancements in Depth Sensing

Recently, there have been a number of advancements in depth sensing which have
occurred in parallel with improvements in computer vision and deep learning.

Classically, depth sensing involved matching pairs of points between aligned RGB
images from two different cameras, and then using the resulting disparity map to
obtain the depth of objects in the environment.

The depth sensors we commonly use today are structured light sensors, which
project a known pattern into the scene using a non-visible wavelength. The
Kinect innovation in particular was to project a known pattern from an infrared
(IR) projector and image that pattern with a single IR camera. Since light
travels in straight lines, a virtual IR camera placed at the projector would
always capture the same image of the pattern. Therefore, the image pattern from
the real IR camera can be matched against a pre-saved “template” image to find
correspondences. This can be done quickly on embedded hardware.

Another approach to depth sensing is LIDAR, an older technique which is
commonly used for surveying land and terrain, and has recently been applied for
some self-driving cars. LIDAR, while generally providing higher-quality depth
maps than Kinect, is slower and more expensive due to the need to scan lasers.

In sum, the Kinect is a consumer-grade RGB-D system that captures RGB images
along with per-pixel depth values directly with the hardware, and is faster and
cheaper (without sacrificing too much accuracy) than prior solutions. Nowadays,
many robots available today for research and industrial purposes, such as the
Fetch Robot and the Toyota Human Support Robot, come equipped with
similar built-in depth sensing cameras. Future advancements in depth sensing for
robots may come from improvements in existing cameras such as Intel’s
, or from newer technologies introduced by companies such as

Prior Research Using Depth Images

The availability of depth sensing in robotics hardware has allowed depth images
to be used for real-time navigation (Maier et al., 2012), for real-time
mapping and tracking
(Newcombe et al., 2011), and for modeling indoor
(Henry et al., 2012). Since depth allows robots to understand
how far they are from obstacles, it enables them to locate and avoid them during

Depth images have additionally been used to detect, identify and localize body
parts of humans in real time
(Plagemann*, Ganapathi*, et al., 2010) with high
reliability on real gaming systems (e.g., the Xbox One). Depth could remove or
mitigate sources of ambiguity, such as lighting and the wide variety of human
appearances and clothing. Other recent work uses simulated depth images to
develop closed-loop policies to guide a robot arm towards an object
(Viereck et al., 2017). In their case, the advantage of depth images was that
large datasets could be rapidly generated in simulation, and the depth images
were simulated relatively accurately using ray tracing.

These results suggest that for some tasks, depth images can encode a sufficient
amount of useful information and color invariance can be beneficial. We describe
three such cases below.

Example 1: Robot Grasping

Universal picking – grasping a large variety of previously unseen objects –
remains a Grand Challenge for robotics. Although many researchers (e.g., Pinto
and Gupta, 2016
) use RGB images, their systems need many months of training
time with robots physically executing grasps. A key advantage of using 3D object
meshes is that one can synthesize accurate depth images via rendering
techniques, which use geometry and camera projection (Johns et al., 2016,
Viereck et al., 2017).

Our Dexterity Network (Dex-Net) is an ongoing research project in the AUTOLab
that encompasses algorithms, code, and datasets for training robot grasping
policies using a combination of large synthetic datasets, analytic robustness
models, stochastic sampling, and deep learning techniques. Dex-Net introduced
domain randomization in the context of grasping, focusing on grasping complex
objects with a simple gripper in contrast to recent work from OpenAI
showing the value of domain randomization for grasping simple objects with a
complex gripper. In a prior BAIR Blog post, we presented a dataset with 6.7
million samples in it, which was used to train a grasp quality model. Here, we
expand the discussion with a focus on depth images.

Dataset and Depth Images

The dataset generation process for Dex-Net. First, a large number of object mesh
models are generated and augmented from a variety of sources. For each model,
multiple parallel-jaw grasps are sampled for it. For each object and grasp
combination, we compute the robustness and generate a simulated depth image.
Robustness is computed by estimating the probability of grasp success over a
stochastic distribution on pose, friction, mass, and external forces (e.g.,
gravity direction) with Monte-Carlo Integration. To the right, we show samples
of positive and negative (success vs failure) grasp attempts, and show the
images that the network sees; the red grasp overlays are only for visualization
purposes. (Open in a new window to enlarge.)

We recently extended Dex-Net to automatically generate a modified synthetic
dataset of grasps on object meshes. Grasps are specified as the planar position,
angle, and depth of a gripper relative to an RGB-D sensor. We present an
overview of the data formation process in the figure above. Our overall goal is
to train a deep network that can detect whether a grasp attempt on some
(singulated) object, represented in a depth image, will succeed.

Training a GQ-CNN

The Grasp Quality CNN architecture. A grasp candidate image (shown to the left)
is processed and aligned based on the angle and center of the grasp, and a
corresponding 96×96 depth image (labeled “Aligned Image”) is passed as input,
along with the height $z$, to predict grasp robustness.

The simulated dataset is used to train a Grasp Quality Convolutional Neural
Network (GQ-CNN) to determine how likely a grasp attempt will succeed. One can
use this GQ-CNN in a policy. For example, a policy could sample various grasps
and feed each through the GQ-CNN, pick the one with the highest grasp success
probability, and then execute its corresponding open-loop trajectory. For an
overview of our results, please see our prior BAIR Blog post.

In 2017, Dex-Net was extended to bin-picking, which involves iteratively
grasping objects from heaps. We modeled bin-picking as a Partially Observed
Markov Decision Process, and generated object heaps via simulation. Due to the
simulation, we were able to obtain full knowledge of the object poses, and used
an algorithmic supervisor to perform demonstrations of the task. We then
fine-tuned a GQ-CNN and performed imitation learning on the supervisor’s policy.
Using the resulting learned policy on a physical ABB YuMi robot, we were
able to clear heaps of 10 objects in under three minutes using only information
from the depth cameras.

Below, we show examples of real and simulated depth images which show grasps
from the Dex-Net system in a setup with multiple objects in a bin.

Top row: real depth images taken from the camera mounted over our ABB YuMi
robot. Bottom row: simulated depth images from Dex-Net. The red overlays
indicate the grasp attempt.

Example 2: Segmenting Objects in Bins

Instance segmentation is the task of determining which pixels in an image belong
to which object, while also separating instances of the same class. Instance
segmentation is widely used for robot perception; for example, as the initial
step in a robotic perception pipeline for grasping objects cluttered in a bin,
where the robot first segments the image to localize the target object to grasp
before executing a grasping policy.

Prior research in computer vision has demonstrated that Mask R-CNN can be
trained to segment objects in RGB images, but this training requires massive
hand-labeled datasets of real RGB images. In addition, images used for training
Mask R-CNN tend to represent natural scenes with limited numbers of object
classes. Thus, pretrained Mask R-CNN networks may not perform well on a task
such as segmenting arbitrary objects in a warehouse bin, and fine-tuning would
require knowledge and hand-labeled examples of each object. If we relax our
requirement that we predict each object’s class in addition to its mask, we can
predict masks for a larger set of object classes, and object geometries become
more influential than object identities.

Dataset and Depth Images

Our dataset formation process. Left: we sample 3D object models similar to those
used in Dex-Net. These are shuffled and dropped into an object heap, either
through simulation or through physical experiments. The corresponding depth
images are created, along with object masks for training and ground-truth

For geometry-based segmentation, we can use simulation and rendering techniques
to automate the process of collecting large and diverse training datasets of
labeled depth images, as shown in the figure above. We hypothesize that these
depth images may contain enough information about segmentation cues, since
discontinuities are indicative of the “pixel borders” of objects. Our simulated
dataset of 50K depth images was generated by sampling several 3D objects out of
1600 models, and dropping them into a bin via PyBullet simulation. Since the
object models are known, we can automatically generate accurate depth images
along with ground-truth masks. Using this dataset, we trained a version of Mask
R-CNN, which we call SD Mask R-CNN, only on synthetic depth data.

Segmentation Results on Real Images

The results suggest that our SD Mask R-CNN can accurately segment despite not
being trained on any real images. We show an example bin picking setup, the
depth image, the ground truth segmentation, two baselines, and our method. The
two rows represent the same bin-picking setup, but with two sensors: high
resolution (top) and low resolution (bottom). (Open in a new window to enlarge.)

Our proposed SD Mask R-CNN outperforms point cloud segmentation and fine-tuned
Mask R-CNN on a dataset of real images despite not being trained on any real
. An example of segmentation results and other related images are shown
above. Importantly, the objects used in creating the hand-labeled dataset of
real images were not chosen from the training distribution of SD Mask R-CNN; in
fact, they were common household items for which we do not have 3D models. Thus,
SD Mask R-CNN can predict masks for previously unseen objects. Moreover, we find
that we can reduce the size of the backbone network of Mask R-CNN (e.g., from
ResNet-101 to ResNet-35) for depth images as compared to training with color

Segmenting objects as “object” or “background” allows for decoupling of the
classification and segmentation stages; we found that for a set of ten objects,
we could train a VGG classifier in less than ten minutes that could achieve over
95% classification accuracy. These results suggest that SD Mask R-CNN could be
used in tandem with a classification network, which could easily be retrained
for each set of objects used.

Overall, our segmentation results suggest three main benefits of using depth
over RGB images:

  1. depth information may encode the geometric cues necessary to separate object
    instances both from each other and the background of the image,
  2. synthetic depth images can be easily and rapidly generated, and training on
    them can effectively transfer to real images,
  3. a network trained using depth images can potentially generalize better to
    previously unseen objects, as geometric cues can be more consistent across

Example 3: Robot Bed-Making

Bed-Making is a task we believe could be well-suited for home robotics since it
is tolerant to error, not time critical, and rarely enjoyed by humans. We
introduced the bed-making task in an earlier blog post
and explored it with
RGB images as a sequential decision problem with noise injection applied for
better imitation learning. In our recent preprint, we used depth sensing
to extend this project to explore transfer between blankets of different colors
and textures and between robots.

Dataset and Depth Images

Examples of initial states for the bed-making task. The first four columns show
samples in the training data. The last two demonstrate the Cal and Teal blankets
that we used to test generalization to different blankets. (Open in a new window
to enlarge.)

We framed the bed-making task as one of detecting corners of a blanket, so that
a mobile home robot such as the Fetch or the HSR, can grasp and pull the blanket
to a corner of the bed frame to maximize blanket coverage. Our starting
hypothesis was that depth images contained enough information about the geometry
of blanket corners to allow for reliable bed-making.

To collect training data, we use white blankets with marked red corners, as
shown in the above image, so that we can automatically detect a corner and thus
a grasping target. We repeatedly toss blankets on the bed surface and collect
RGB and depth images from the robot’s onboard RGB-D sensors.

We next train a deep convolutional neural network to detect corners from depth
images only
, with the hope that the network will generalize to detecting
corners from depth images of different blankets. Our deep network utilizes
pre-trained weights from YOLO, a fast real-time object detector, because the
task of finding a grasping point is similar to detection. We then add several
layers after this, which we train with our dataset of 2018 depth images (and
yes, 2018 is just a coincidence). Our results indicate that using pre-trained
weights is beneficial despite the depth versus RGB mismatch; the pre-trained
weights from YOLO were obtained by training on RGB images.

Another advantage of depth images is that it lets us remove sources of
distraction. For example, we want the robots to grasp blanket corners. These are
not located in areas far beyond the top surface of the bed. Thus, we can “black
out” regions beyond a validation-tuned depth value (we used 1.4 meters as the
cutoff) before scaling pixels within $[0,255]$. We provide a simple script
that one can use for processing depth images.

Corner Detection Results

Visualization of a bed-making rollout with an emphasis on corner detection. In
the first row, the robot’s trained grasp policy correctly identifies the corner
(top left) and the resulting situation after the grasp and pull is in the top
right. On the other side, the policy again detects the corner well (bottom left)
and the resulting grasp and pull is shown in the bottom right.

We deployed our trained grasping policy and found that in terms of blanket
coverage, it significantly outperformed a non-learning baseline policy, and was
nearly as good as a human supervisor. While our metric here is blanket coverage
rather than detecting corners, accurate detection is strongly correlated with
higher coverage.

In the above image, we show the corner predictions on a teal blanket with the
red cross hair. The grasping network was not trained on teal blankets, and only
saw the depth images, but nonetheless is able to detect corners accurately since
the test-time depth images look similar to depth images from training. After the
robot moves to the other side of the bed to attempt another grasp, it again does
an excellent job in detecting the nearest corner. We tried using RGB-trained
grasping policies, but these did not perform well since the original RGB trained
policy was only on white blankets, and we would need far more blankets and
training data to generalize across blanket colors.

Depth Matters

Our results in these projects suggest that depth maps contain sufficient clues
for the tasks of determining grasp points, segmenting images, and detecting
corners of deformable objects. We conjecture that, as the quality of depth
cameras improves in tandem with reduction in costs, depth images will be an
increasingly important modality for robotics. It is far easier to synthesize
training examples with depth images, color-invariance results naturally, and
background noise can be easily filtered (as we demonstrate in robot bed-making).
Depth images are lower dimensional than RGB (one vs three 8-bit channels) and
CNNs appear to learn filters for edges and spatial patterns in both.

Paper References

We encourage readers to check out the following papers and project websites for
more details.

Additional papers and projects can be found at the AUTOLab website.