Performance Evaluation of the Covariance Descriptor for Target Detection Pedro Cortez–Cargill * , Cristobal Undurraga–Rius * , Domingo Mery * and Alvaro Soto * * Computer Science Department Pontificia Universidad Cat´ olica de Chile Av.Vicu˜ na Mackenna 4860 (143), Santiago, Chile Email: pmcortez@uc.cl, caundurr@uc.cl, dmery@ing.puc.cl, asoto@ing.puc.cl Abstract—In computer vision, there has been a strong advance in creating new image descriptors. A descriptor that has recently appeared is the Covariance Descriptor, but there have not been any studies about the different methodologies for its construction. To address this problem we have conducted an analysis on the contribution of diverse features of an image to the descriptor and therefore their contribution to the detection of varied targets, in our case: faces and pedestrians. That is why we have defined a methodology to determinate the performance of the covariance matrix created from different characteristics. Now we are able to determinate the best set of features for face and people detection, for each problem. We have also achieved to establish that not any kind of combination of features can be used because it might not exist a correlation between them. Finally, when an analysis is performed with the best set of features, for the face detection problem we reach a performance of 99%, meanwhile for the pedestrian detection problem we reach a performance of 85%. With this we hope we have built a more solid base when choosing features for this descriptor, allowing to move forward to other topics such as object recognition or tracking. Index Terms—Region Covariance, target detection. I. I NTRODUCTION One of the most extraordinary abilities of the human vision is to recognize objects and faces. No matter the angle, size, luminosity or occlusion of the object, the human vision is able, in almost every case, to recognize the object or person. This ability is primordial in many aspects of our lives, for example, without this capacity to recognize faces or facial expression we could not have a satisfactory social life. Given this definition, the next logical step is to design machines or systems that could achieve to imitate this ability automatically, to use them in applications such as vigilance or quality control. Computer vision is a subfield of artificial intelligence. Its main objective is to program machines that could understand or recognize the patterns of a scene or the characteristics of an image. These tasks have been a remarkable challenge that have not yet been achieved. Thanks to the advances in technology and the research conducted in the last few years, have been created many different applications for detection and recognition in varied fields. This includes video-games, driver assistance, video edition, quality control, transit control, vigilance, security, tracking, etc. For example: for driver assistance, there are applications that warn the drivers when they are falling sleep, using facial expression recognition [1]; in quality control there are many applications which can determinate if a product is in perfect shape or not, using features such as size, shape, color, etc. of an image [2]; in vigilance and security there are applications which, from a security video, detect strange objects or behaviors (robberies, violence, trespassing, etc.) [3]. Actually, to attain these tasks different techniques are used through which relevant information is obtained of images or videos known as features and descriptors [4]. The features selection is an important step for the detection and recognition of objects. A descriptor must be ideally discriminative, robust, and fast to compute. There is a great variety of descriptors, some of them are focused in being computed faster, meanwhile others, in obtaining as much information as possible. On the other hand, there are algorithms that detect regions of interest and invariants to size, luminosity and perspective; this way only the features of relevant regions are computed, instead of the entire image. This technology is known as viewpoint invariant segmentation [5], [6]. In this paper we have defined a methodology which determinates the performances of different covariance matrix built from distinct sets of features. With this we are able to define which ones are the best for detection of objects, faces and pedestrians. First, we obtain a set of images, where we select a specific target that we want to detect. Next, we find, in a search image, the region with the smallest distance to the target region initially selected. This way we define an acceptation threshold, that decides if the object is in the search image or not. We obtain the performance for each set of features used in the creation of the covariance descriptor. Finally when an analysis with the best set of features is performed we get for the face detection problem, a performance of 99%, meanwhile for the pedestrian detection problem, we get a performance of 85%. The rest of the paper is organized as follows. In section 2 we describe the state of art of the problem; in section 3 we explain the mathematical basics, the hypothesis and the implementation of the problem; in section 4, we present the proposed methodology and the results obtained; and finally