COMPUTER VISION AND IMAGE UNDERSTANDING Vol. 70, No. 1, April, pp. 63–73, 1998 ARTICLE NO. IV970619 A Dynamic and Multiresolution Model of Visual Attention and Its Application to Facial Landmark Detection Barnab´ as Tak´ acs ∗ Virtual Celebrity Productions, 3679 Motor Ave. Suite 200, Los Angeles, California 90034 and Harry Wechsler† Department of Computer Science, George Mason University, Fairfax, Virginia 22030-4444 Received April 7, 1995; accepted February 19, 1997 We describe a novel dynamic and multiresolution attention scheme for the generation of visual saccades and its application to locate candidate regions for facial feature recognition. The low-level, data-driven attention model suggested herein, employs a nonlinear sampling lattice of oriented Gaussian filters and uses small oscillatory movements to extract local image characteristics (conspicuity). As the sampling grid moves over the image, multi- resolution “evidences” of local features are accumulated in a short- term visual memory. We propose a simple integration technique that computes the saliency surface iteratively across saccadic move- ments. Simulation results on face images demonstrate the applica- bility of our approach. c 1998 Academic Press Key Words: attention; face recognition; low-level vision. 1. INTRODUCTION Understanding the computational principles of visual percep- tion is one of the most challenging scientific problems. Building computer models to capture, at some level of abstraction, the computational aspects of visual perception can provide a con- ceptual framework for identifying and understanding its basic principles. The mechanisms revealed using these models may also help us to analyze some of the computational processes taking place in the brain. The flow of visual input reaching the eye consists of huge amounts of time-varying information. It is crucial for both biological entities and automated systems to perceive and comprehend the changes made available by such imagery. One should locate and analyze only the information relevant to the current visual task in order to support efficient use of computing resources and quickly focus on selected ar- eas of the scene as needed. Attention mechanisms are thus re- quired to balance between computationally expensive parallel ∗ E-mail: takacsb@virtualceleb.com. † E-mail: wechsler@cs.gmu.edu. techniques and time-intensive serial techniques in order to sim- plify computation and reduce the amount of further processing [20]. Attention is basically a problem of intelligent control deal- ing with the allocation of computational resources in terms of where, what, and how to sense and process the data. Besides complexity reasons [7], efficient attention schemes also form the basis of behavioral coordination [1]. In the context of face recognition, finding regions of interest (such as eye sockets, nose, or mouth) prior to the use of more complex and computationally extensive recognition stages, would enhance both speed and recognition accuracy. Face recog- nition starts with the detection of face patterns, proceeds by normalizing the face images to account for geometrical and illu- mination changes (typically using information about the facial landmarks), and identifies the faces using appropriate classifica- tion algorithms. There are two major approaches for automated identification of human faces. The abstractive (feature point ex- traction) approach seeks to define a set of key parameters for the measurement of faces and to subsequently employ standard statistical pattern recognition techniques for matching among faces using these measurements. In contrast to the abstractive type, one can use holistic (“template matching”) approaches characteristic of methods such as backpropagation (“holons”), principal component analysis (PCA), and singular value decom- position (SVD) using eigenfaces. Note that both the abstractive and holistic approaches require the detection of facial landmarks for feature measurements and normalization, respectively. Such detection is also characteristic of attention mechanism used by the human visual system to screen out the visual field and focus on salient inputs. This paper focuses on the implementation of a data-driven (bottom-up) mechanism that localizes regions of interest in the input. The novel dynamic and multiresolution scheme described herein is mostly concerned with the aspects involved in select- ing (information loaded) fixation points and the early detection of salient facial landmarks as needed for further normalization 63 1077-3142/98 $25.00 Copyright c 1998 by Academic Press All rights of reproduction in any form reserved.