264 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 1, NO. 3, SEPTEMBER 1999 Face Detection Using Quantized Skin Color Regions Merging and Wavelet Packet Analysis Christophe Garcia and Georgios Tziritas, Member, IEEE Abstract— Detecting and recognizing human faces automat- ically in digital images strongly enhance content-based video indexing systems. In this paper, a novel scheme for human faces detection in color images under nonconstrained scene conditions, such as the presence of a complex background and uncontrolled illumination, is presented. Color clustering and filtering using approximations of the YCbCr and HSV skin color subspaces are applied on the original image, providing quantized skin color regions. A merging stage is then iteratively performed on the set of homogeneous skin color regions in the color quantized image, in order to provide a set of potential face areas. Constraints related to shape and size of faces are applied, and face intensity texture is analyzed by performing a wavelet packet decomposition on each face area candidate in order to detect human faces. The wavelet coefficients of the band filtered images characterize the face texture and a set of simple statistical deviations is extracted in order to form compact and meaningful feature vectors. Then, an efficient and reliable probabilistic metric derived from the Bhattacharrya distance is used in order to classify the extracted feature vectors into face or nonface areas, using some prototype face area vectors, acquired in a previous training stage. Index Terms— Bhattacharrya distance, color clustering, face detection, wavelet decomposition. I. INTRODUCTION D ETECTING human faces automatically is becoming a very important task in many applications, such as secu- rity access control systems or content-based indexing video retrieval systems like the Distributed audioVisual Archives Network (DiVAN) system [11]. The European Esprit project DiVAN aims at building and evaluating a distributed audio- visual archives network providing a community of users with facilities to store raw video material, and access it in a coherent way, on top of high-speed wide area communication networks. The raw video data is first automatically segmented into shots using techniques based on color histograms dissimilarity and camera motion analysis. From the content-related image segments and keyframes, salient features such as region shape, intensity, color, texture, and motion descriptors are extracted and used for indexing and retrieving information. In order to allow queries at a higher semantic level, some particular pictorial objects may be detected and exploited for indexing. The automatic detection of human faces provides users with Manuscript received April 16, 1999; revised July 13, 1999. The associate editor coordinating the review of this paper and approving it for publication was Prof. Alberto DelBimbo. This work was supported in part by the DiVAN Esprit Project EP 24956. The authors are with the Institute of Computer Science, Foundation for Research and Technology-Hellas, GR 711 10 Heraklion, Crete, Greece (e- mail: cgarcia@csi.forth.gr; tziritas@csi.forth.gr). Publisher Item Identifier S 1520-9210(99)07634-8. powerful indexing capacities of the video material. In DiVAN, faces are detected in the extracted keyframes and stored in a meta-database. Without performing face recognition, frames containing faces may be searched according to the number, the sizes, or the positions of the detected faces within the keyframes, looking for specific classes of scenes representing a large audience (multiple faces), an interview (two medium size faces) or a close-up view of a speaker (a large size face). Face recognition may follow face detection when faces are large enough and in a semi-frontal position, using a method we developed and described in [16]. When detected faces are recognized and associated automatically with textual information like in the systems Name-it [33] or Piction [6], potential applications such as news video viewer providing description of the displayed faces, news text browser giving facial information, or automated video annotation generators for faces are possible. Although face detection is closely related to face recognition as a preliminary required step, face recognition algorithms have received most of the attention in the academic literature compared to face detection algorithms. Considerable progress has been made on the problem of face recognition, especially under stable conditions such as small variations in lighting, facial expression and pose. Extensive surveys are presented in [42] and [5]. These methods can be roughly divided into two different groups: geometrical features matching and template matching. In the first case, some geometrical measures about distinctive facial features such as eyes, mouth, nose and chin are extracted [3], [8]. In the second case, the face image, represented as a two-dimensional (2-D) array of intensity val- ues, is compared to a single or several templates representing a whole face. The earliest methods for template matching are correlation-based, thus computationally very expensive and require great amount of storage. In the last decade, the principal components analysis (PCA) method also known as Karhunen–Loeve transform, has been successfully applied in order to perform dimensionality reduction [22], [39], [29], [37], [1]. We may cite other methods using neural network classification [30], [9], using algebraic moments [18], using isodensity lines [28], or using a deformable model of templates [23], [43]. In most of these face recognition approaches, existence and location of human faces in the processed images are known a priori, so there is little need to detect and locate faces. In image and video databases, there is generally no constraint on the number, location, size, and orientation of human faces and the background is generally complex. Moreover, color information 1520–9210/99$10.00  1999 IEEE