3D FACE RECONSTRUCTION FROM VIDEO USING A GENERIC MODEL A.Roy Chowdhury , R.Chellappa Center for Automation Research and Department of ECE University of Maryland College Park, MD 20742 amitrc,rama @cfar.umd.edu S.Krishnamurthy CACC, Dept. of ECE North Carolina State University, Raliegh, NC shkrishn@unity.ncsu.edu T.Vo Department of CS California State University, Fullerton, CA taivip@yahoo.com ABSTRACT Reconstructing a 3D model of a human face from a video sequence is an important problem in computer vision, with applications to recognition, surveillance, multimedia etc. However, the quality of 3D reconstructions using structure from motion (SfM) algorithms is often not satisfactory. One common method of overcoming this problem is to use a generic model of a face. Existing work us- ing this approach initializes the reconstruction algorithm with this generic model. The problem with this approach is that the algo- rithm can converge to a solution very close to this initial value, resulting in a reconstruction which resembles the generic model rather than the particular face in the video which needs to be mod- eled. In this paper, we propose a method of 3D reconstruction of a human face from video in which the 3D reconstruction algorithm and the generic model are handled separately. A 3D estimate is obtained purely from the video sequence using SfM algorithms without use of the generic model. The final 3D model is obtained after combining the SfM estimate and the generic model using an energy function that corrects for the errors in the estimate by com- paring local regions in the two models. The optimization is done using a Markov Chain Monte Carlo (MCMC) sampling strategy. The main advantage of our algorithm over others is that it is able to retain the specific features of the face in the video sequence even when these features are different from those of the generic model. The evolution of the 3D model through the various stages of the algorithm is presented. 1. INTRODUCTION Reconstructing 3D models from video sequences is an important problem in computer vision with applications to recognition, med- ical imaging, video communications etc. Though numerous algo- rithms exist which can reconstruct a 3D scene from two or more images using structure from motion (SfM) [1], the quality of such reconstructions is often poor. The main reason for this is the poor quality of the input images and a lack of robustness in the recon- struction algorithms to deal with it [2]. One particularly interesting application of 3D reconstruction from 2D images is in the area of modeling a human face from video. The successful solution of Partially supported by NSF grant #0086075 The author worked on this problem during his stay at Maryland in Fall 2001. The author developed a part of the code during his summer internship at Maryland in 2001 this problem has immense potential for applications in face recog- nition, surveillance, multimedia etc. A few algorithms exist which attempt to solve this problem using a generic model of a face [3, 4]. Their typical approach is to initialize the reconstruction algorithm with this generic model. The problem with this approach is that the algorithm can converge to a solution very close to this initial value, resulting in a recon- struction which resembles the generic model rather than the par- ticular face in the video which needs to be modeled. This method might give very good results when the generic model has signifi- cant similarities with the particular face being reconstructed. How- ever, if the features of the generic model are very different from those being reconstructed, the solution from this approach may be highly erroneous. We propose an alternative way of reconstructing a 3D model of a face. Our method also incorporates a generic model; how- ever, we do so after obtaining the estimate from the SfM algo- rithm. The SfM algorithm reconstructs purely from the video data. This reconstruction is fused with the generic model in an energy function minimization framework [5]. The 3D estimate obtained from the reconstruction algorithm needs to be smoothed in local regions where there are errors. These regions are identified with the help of the generic model. After the 3D depth estimate and the generic model have been aligned, the boundaries where there are sharp depth discontinuities are identified from the generic model. Each vertex of the triangular mesh representing the model is as- signed a binary variable (defined as a line process, following the terminology of [6]) depending upon whether or not it is part of a depth boundary. The regions which are inside these boundaries are smoothed. The energy function consists of two terms which de- termine the closeness of the final smoothed solution to either the generic model or the 3D depth estimate, and a third term which de- termines whether or not a particular vertex of the mesh should be smoothed based on the value of the variable representing the line process for that vertex. The combinatorial optimization problem is solved using simulated annealing and a Markov Chain Monte Carlo sampling strategy [7, 8, 9]. The advantage of this method is that the particular characteristics of the face that is being mod- eled are not lost since the SfM algorithm does not incorporate the generic model. Moreover, any errors in the reconstruction are cor- rected in the energy function minimization process by comparison with the generic model.