Int J Comput Vis (2015) 111:69–97 DOI 10.1007/s11263-014-0734-4 3DNN: 3D Nearest Neighbor Data-Driven Geometric Scene Understanding Using 3D Models Scott Satkin · Maheen Rashid · Jason Lin · Martial Hebert Received: 11 November 2013 / Accepted: 23 May 2014 / Published online: 22 July 2014 © Springer Science+Business Media New York 2014 Abstract In this paper, we describe a data-driven approach to leverage repositories of 3D models for scene understand- ing. Our ability to relate what we see in an image to a large collection of 3D models allows us to transfer informa- tion from these models, creating a rich understanding of the scene. We develop a framework for auto-calibrating a cam- era, rendering 3D models from the viewpoint an image was taken, and computing a similarity measure between each 3D model and an input image. We demonstrate this data-driven approach in the context of geometry estimation and show the ability to find the identities, poses and styles of objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of mag- nitude less data, and recognize objects from never-before- seen viewpoints. In this work, we describe the 3DNN algo- rithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation, Communicated by Cordelia Schmid. S. Satkin (B ) Google Inc., Mountain View, CA, USA e-mail: satkin@google.com M. Rashid · M. Hebert Carnegie Mellon University, Pittsburgh, PA, USA e-mail: maheenr@andrew.cmu.edu M. Hebert e-mail: hebert@ri.cmu.edu J. Lin Microsoft Corp., Redmond, WA, USA e-mail: jasonlin@alumni.cmu.edu as well as two novel applications: affordance estimation and photorealistic object insertion. Keywords Computer vision · Machine learning · Scene understanding · Geometry estimation · 3D data 1 Introduction This work explores the intersection of geometric reasoning and machine learning for scene understanding. Our objec- tive is to produce a rich representation of the world from a single image by relating what we see in the image with vast repositories of 3D models, as shown in Fig. 1. By matching and aligning an image with 3D data, we can produce detailed reconstructions of scenes and transfer rich information from the models to answer a wide variety of queries. Our work builds upon recent advances in data-driven scene matching and single-view geometry estimation, which we now sum- marize. 1.1 Data-Driven Approaches in Computer Vision Over the past decade, researchers have demonstrated the effectiveness of data-driven approaches for complex com- puter vision tasks. Large datasets such as Torralba et al. (2008)’s 80 Million Tiny Images and Deng et al. (2009)’s ImageNet have proven to be invaluable sources of informa- tion for tasks like scene recognition and object classifica- tion. Simple nearest-neighbor approaches for matching an input image (or patches of an image) with a large corpus of annotated images enables the “transfer” of information from one image to another. These non-parametric approaches have been shown to achieve amazing performance for a wide vari- ety of complex computer vision and graphics tasks ranging 123