Comparing Strategies for 3D Face Recognition from a 3D Sensor* Jongmoo Choi, Ayush Sharma, and G´ erard Medioni 1 Abstract—We address the problem of 3D face recognition from 3D data, using different strategies. One strategy (1F-NF), explored earlier, is to match each individual frame to a set of reference frames. A second one (1F-3D) is to replace the set of reference frames by a 3D model resulting from the integration of individual frames. A third strategy (3D-3D) is to use a 3D face model inferred from multiple frames as the input probe. We show that the recognition performance using 3D model to 3D model outperforms the others, at the cost of a delay in response, due to the model building step. I. INTRODUCTION Face recognition has been an active research topic for several decades and various techniques have been pre- sented [1][2][3][4]. Traditional 2D image-based face recog- nition methods appear to be sensitive to variations in pose, illumination and expression changes [1]. Since pixel intensity is a non-linear combination of the geometry, viewpoint, lighting, and surface properties, capturing invariant features from projected images is a difficult problem. Many researchers have presented 3D face recognition methods, as the shape information is independent of view- point and lighting changes [25]. Most existing methods use laser scanning [1], stereo vision [2], structure from motion [3][5], or generic face model [8][6][7] to obtain 3D face models. A laser scanner is slow and expensive. Multiple image-based approaches are instable and suffer from costly processing. Recent success of low-cost depth cameras [11], such as PrimeSense camera [27][28], enables to process RGB and depth video stream for 3D face recognition. We have previously presented a real-time 3D face identi- fication system using a low-cost depth camera in which both an input probe and a set of gallery data are registered with a small number of reference faces in order to reduce computa- tional complexity while preserving recognition rate [9]. We have also presented an accurate 3D face modeling technique that produces a laser scan quality 3D face model from a noisy depth video stream by aggregating registered 3D data into a 2D unwrapped cylindrical coordinate system [10]. Clearly, the performance of 3D recognition system depends on the quality of the input data [25], and accurate 3D face models should enable us to improve the recognition rate. However, *This work was partly supported by the IT R&D program of MKE & KEIT [10041610, The development of the recognition technology for user identity, behavior and location that has a performance approaching recognition rates of 99% on 30 people by using perception sensor network in the real environment] 1 J. Choi, A. Sharma, and G. Medioni are with the Institute for Robotics and Intelligent Systems, Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089-0273, USA {jongmooc, ayushsha, medioni} at usc.edu Fig. 1. Three Strategies for 3D face recognition: 1F-NF (single vs. N frames), 1F-3D (single vs. 3D), and 3D-3D (3D face vs. 3D face). We use 1F-1F (single vs. single frame) as a baseline performance. the best strategy for combining multiple input frames is not obvious. Of course, one can argue that using more depth data pro- vides better performance because it has more information. It is also shown in our previous work [9]. In contrast, we might lose some information during the modeling process because we use a 2D unwrapped cylindrical system that allows us to represent only star-shape objects [10]. Hence, one important question is whether a system using reconstructed 3D face model performs better than a system using the raw depth frames used for the reconstruction. Our hypothesis is that our modeling should provide better results since it enhances the signal to noise ratio up to a certain point by aggregating multiple observations. To answer to the question, we need a comparison between a method using multiple frames (1F- NF) and a method using a single 3D face mode generated from the same frames (1F-3D). It is also possible to input multiple frames from an user in many practical applications including human-robot interac- tions [23]. In this case, we can use all the raw depth images as the probe set (NF-1F) or we can build an accurate 3D face model for the probe (3D-1F or 3D-3D). Because of the symmetric nature of the matching process, the performance of NF-1F strategy can be replaced by the result of 1F-NF.