Int J Comput Vis (2015) 111:69–97
DOI 10.1007/s11263-014-0734-4
3DNN: 3D Nearest Neighbor
Data-Driven Geometric Scene Understanding Using 3D Models
Scott Satkin · Maheen Rashid · Jason Lin · Martial Hebert
Received: 11 November 2013 / Accepted: 23 May 2014 / Published online: 22 July 2014
© Springer Science+Business Media New York 2014
Abstract In this paper, we describe a data-driven approach
to leverage repositories of 3D models for scene understand-
ing. Our ability to relate what we see in an image to a
large collection of 3D models allows us to transfer informa-
tion from these models, creating a rich understanding of the
scene. We develop a framework for auto-calibrating a cam-
era, rendering 3D models from the viewpoint an image was
taken, and computing a similarity measure between each 3D
model and an input image. We demonstrate this data-driven
approach in the context of geometry estimation and show the
ability to find the identities, poses and styles of objects in a
scene. The true benefit of 3DNN compared to a traditional
2D nearest-neighbor approach is that by generalizing across
viewpoints, we free ourselves from the need to have training
examples captured from all possible viewpoints. Thus, we
are able to achieve comparable results using orders of mag-
nitude less data, and recognize objects from never-before-
seen viewpoints. In this work, we describe the 3DNN algo-
rithm and rigorously evaluate its performance for the tasks
of geometry estimation and object detection/segmentation,
Communicated by Cordelia Schmid.
S. Satkin (B )
Google Inc., Mountain View, CA, USA
e-mail: satkin@google.com
M. Rashid · M. Hebert
Carnegie Mellon University, Pittsburgh, PA, USA
e-mail: maheenr@andrew.cmu.edu
M. Hebert
e-mail: hebert@ri.cmu.edu
J. Lin
Microsoft Corp., Redmond, WA, USA
e-mail: jasonlin@alumni.cmu.edu
as well as two novel applications: affordance estimation and
photorealistic object insertion.
Keywords Computer vision · Machine learning · Scene
understanding · Geometry estimation · 3D data
1 Introduction
This work explores the intersection of geometric reasoning
and machine learning for scene understanding. Our objec-
tive is to produce a rich representation of the world from a
single image by relating what we see in the image with vast
repositories of 3D models, as shown in Fig. 1. By matching
and aligning an image with 3D data, we can produce detailed
reconstructions of scenes and transfer rich information from
the models to answer a wide variety of queries. Our work
builds upon recent advances in data-driven scene matching
and single-view geometry estimation, which we now sum-
marize.
1.1 Data-Driven Approaches in Computer Vision
Over the past decade, researchers have demonstrated the
effectiveness of data-driven approaches for complex com-
puter vision tasks. Large datasets such as Torralba et al.
(2008)’s 80 Million Tiny Images and Deng et al. (2009)’s
ImageNet have proven to be invaluable sources of informa-
tion for tasks like scene recognition and object classifica-
tion. Simple nearest-neighbor approaches for matching an
input image (or patches of an image) with a large corpus of
annotated images enables the “transfer” of information from
one image to another. These non-parametric approaches have
been shown to achieve amazing performance for a wide vari-
ety of complex computer vision and graphics tasks ranging
123