Learning to Open New Doors Ellen Klingbeil, Ashutosh Saxena, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305 {ellenrk,asaxena,ang}cs.stanford.edu Abstract As robots enter novel, uncertain home and office envi- ronments, they are able to navigate these environments successfully. However, to be practically deployed, robots should be able to manipulate their environment to gain access to new spaces, such as by opening a door and op- erating an elevator. This remains a challenging problem because a robot will encounter doors it has never seen before. Objects such as door handles and elevator buttons, though very different in appearance, are functionally similar. Thus, they share some common features in the way they can be perceived and acted upon. We present a vision-based learning algorithm that captures these fea- tures to: (a) find where the door handle is located, and (b) infer how to manipulate it to open the door. Our sys- tem assumes no prior knowledge of the 3-D location or shape of the door handle. We also experimentally ver- ify our algorithms on doors not seen in the training set, advancing our work towards being the first to enable our robot to navigate anywhere in a new building by opening doors and elevators, even ones it has not seen before. Introduction There has been recent interest in using robots not only in con- trolled factory environments but also in unstructured home and office environments. In the past, successful navigation algorithms have been developed for robots in these environ- ments; but to be practically deployed, robots must also be able to manipulate their environment to gain access to new spaces, such as by opening a door and by operating an eleva- tor. This remains a challenging problem because a robot will likely encounter doors and elevators it has never seen before. In robotic manipulation, most work has focused on devel- oping control actions for different tasks, such as grasping ob- jects (Bicchi & Kumar 2000), assuming a detailed 3-D model of the environment is known. There has been some recent work in opening doors using manipulators (Rhee et al. 2004; Petersson, Austin, & Kragic 2000; Kim et al. 2004; Prats, Sanz, & del Pobil 2007); however, it was focused on devel- oping control actions assuming a known location of a known door handle. (Petrovskaya & Ng 2007) assumed a known de- tailed model of the door and door handle to be opened. In practice, a robot has to rely on only its sensors to be able to perform manipulation in a new environment, and current sensor technology does not have enough resolution to build a Copyright c 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. detailed model of the object that is required for manipulation purposes. Most work in computer vision has focused on object recog- nition, e.g. (Serre, Wolf, & Poggio 2005). However, for ma- nipulation purposes, a robot not only needs to locate the ob- ject, but also needs to find out what to do with the object. For example, if the intention of the robot is to enter a door, it needs to find out where the door handle is as well as deter- mine what action it must take in that situation—turn the door handle right and push for example. Our work does not assume existence of a known model of the object (such as the door, the door handle, or the eleva- tor button). Instead, we focus our work on the problem of manipulation in novel environments, in which a model of the objects is not available. We also demonstrate the robustness of our algorithms through extensive experiments in which the robot was able to reliably open new doors in new buildings, even ones which were seen for the first time by the robot (and the researchers working on the algorithm). Algorithm Overview Our perception system consists of two parts: (a) Object Per- ception: finding the object, and (b) inferring how to manipu- late the object. For finding the object, we compute features that were mo- tivated in part by some recent work in object recognition (Serre, Wolf, & Poggio 2005) and robotic grasping (Saxena et al. 2006). We use the Support Vector Machines (SVM) (Vap- nik 1995) learning algorithm and select the most relevant di- rections using Principal Component Analysis. We also take advantage of some contextual information to learn a loca- tion based prior (partly motivated by (Torralba 2003)). This captures properties such as that a door handle is less likely to be found close to the floor. To deal with multiple han- dles/buttons in an image and the spatial correlation between their predicted locations (see Figure 1), we used a K-means clustering algorithm to return the center of each handle. We estimate the 3-D location of the handle/button from the 2-D location in the camera frame and from a horizontal laser scan, by assuming that the walls are vertical to the ground. Given a rectangular image patch containing an object, we then need to classify what action to take. We consider three types of abstract actions: turn left, turn right and press. To distinguish between such actions, we used a similar classifier (as described above) and achieved an overall classification accuracy of 94.1%. With the 3-D location and desired abstract action type known, we now define each abstract action as a set of key-