Using Stereo Matching with General Epipolar Geometry for 2D Face Recognition across Pose Carlos D. Castillo, Student Member, IEEE, and David W. Jacobs, Member, IEEE Abstract—Face recognition across pose is a problem of fundamental importance in computer vision. We propose to address this problem by using stereo matching to judge the similarity of two, 2D images of faces seen from different poses. Stereo matching allows for arbitrary, physically valid, continuous correspondences. We show that the stereo matching cost provides a very robust measure of similarity of faces that is insensitive to pose variations. To enable this, we show that, for conditions common in face recognition, the epipolar geometry of face images can be computed using either four or three feature points. We also provide a straightforward adaptation of a stereo matching algorithm to compute the similarity between faces. The proposed approach has been tested on the CMU PIE data set and demonstrates superior performance compared to existing methods in the presence of pose variation. It also shows robustness to lighting variation. Index Terms—Face recognition, pose, stereo matching, epipolar geometry. Ç 1 INTRODUCTION FACE recognition is a fundamental problem in computer vision. There has been a lot of progress in the case of images taken under constant pose [30]. There are also several approaches to handling pose variation [24], [15], [17], [8]. However, there is still a lot of room for improvement. Progress would be important in many applications, for example, surveillance, security, and the analysis of personal photos and other domains in which we cannot control the position of subjects relative to the camera. Correspondence seems crucial to producing meaningful image comparisons. The importance of good correspondences is even greater in the case of face recognition across pose. Standard systems often align the eyes or a few other features, using translation, similarity transformations, or perhaps affine transfor- mations. However, when the pose varies these can still result in fairly significant misalignments in other parts of the face. Observe, for example, Fig. 1. To handle this situation, we use stereo matching. This allows for arbitrary, one-to-one continuous transformations between images, along with possible occlusions, while maintaining an epipolar constraint. In the process of computing the correspondences between scan lines in two images, a stereo matching cost is optimized, which reflects how well the two images match. We show that the stereo matching cost is robust to pose variations. Consequently, we can use the stereo matching cost as a measure of similarity between two face images. Note that we are not interested in performing 3D reconstruc- tion, which is the most common purpose of stereo matching. In reconstruction, the stereo matching costs are discarded and the correspondences are used along with geometric information about the camera layout to compute a 3D model of the world. We have no use for the correspondences except to compute the stereo matching costs. We are therefore unaffected by some of the difficulties that make it hard to avoid artifacts in stereo reconstruc- tion. For example, ambiguities frequently arise when different correspondences produce similar costs; in this case, selecting the correct correspondence is essential for reconstruction, but not very important for judging the similarity of two images. Prior to stereo matching, we need to estimate the epipolar geometry. In almost all applications of face recognition, the size of the face is small relative to its distance to the camera. Therefore, we can approximate the projection of the face to the camera using scaled orthographic projection (weak perspective). We can therefore use four feature points to estimate the epipolar geometry of the two faces. The images are then rectified and the similarity score is computed by adding the stereo matching cost of every row of the rectified images. We also study a specific case in which the camera is at the same height as the eyes of an upright subject. In this case, the epipolar lines are parallel to the lines that connect the two eyes. In this case, we can determine epipolar geometry using only three points. We also tried obtaining the epipolar geometry from each pair of images using the method of Domke and Aloimonos [11], [12]. In this case, our method requires no hand-clicked points. We verified that there is no decrease in recognition performance in a fully automatic system. Putting these steps together, we have the following simple algorithm: . Prior to recognition, build a gallery of 2D images of faces, each with three to four landmark points specified. . Given a 2D probe image, find three to four corresponding landmark points. . Compare the probe to each gallery image as follows: - Using landmark points, rectify the probe and gallery image. - Run a stereo algorithm on the image pair, using the enhancements described in Section 4. Discard the correspondences and use the matching cost as a measure of image similarity. . Identify the probe with the gallery image that produces the lowest matching cost. We will show that this method works very well even for large viewpoint changes. We evaluate our method using the CMU PIE data set and the Labeled Faces in the Wild (LFW) data set. Our results show that with pose variation at constant illumination our method is more accurate than previous methods of Gross et al. [17], Chai et al. [8], and Romdhani et al. [24]. While our method is designed to only handle pose variation, we also test it with pose and illumination variation to verify that our method does not fall apart in such a setup. Surprisingly, our method is more accurate than the method of Gross et al. [15], which is designed to handle lighting variation, though it is not as accurate as the method of Romdhani et al. [24]. The experiments on the LFW data set show reasonable performance in an unconstrained setting (where there is simultaneous variation in pose, illumination, and expression). This is an extended version of our conference paper [7]. The original conference version does not include our method with four feature points or experiments using four feature points, includes limited experiments with lighting change, and does not include the results on the LFW data set. Additionally, in the conference paper we did not develop a fully automatic system. However, the conference version of our paper includes an analysis of stereo matching for face recognition that has been eliminated from this version due to space constraints. The rest of the paper is organized as follows: Section 2 discusses related work. Section 3 discusses issues related to image alignment IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. XX, XXXXXXX 2010 1 . The authors are with the Department of Computer Science, University of Maryland, 4420 A.V. Williams Bldg., College Park, MD 20742. E-mail: {carlos, djacobs}@cs.umd.edu. Manuscript received 28 Aug. 2008; revised 1 Apr. 2009; accepted 29 Apr. 2009; published online 21 May 2009. Recommended for acceptance by M.-H. Yang. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2008-08-0575. Digital Object Identifier no. 10.1109/TPAMI.2009.123. 0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society