EVALUATING PERFORMANCE OF AUTOMATIC IMAGE ANNOTATION: EXAMPLE CASE BY FUSING GLOBAL IMAGE FEATURES Ville Viitaniemi and Jorma Laaksonen Adaptive Informatics Research Centre Helsinki University of Technology {ville.viitaniemi,jorma.laaksonen}@tkk.fi ABSTRACT In this paper we consider two traditional metrics for eval- uating the performance in automatic image annotation, the normalised score (NS) and the precision/recall (PR) statis- tics, particularly in connection with a de facto standard 5000 Corel image benchmark annotation task. We also motivate and describe a third performance measure, de-symmetrised termwise mutual information (DTMI), as a principled com- promise between the two traditional extremes. In addition to discussing the measures theoretically, we correlate them ex- perimentally for a family of annotation system configurations derived from the PicSOM image content analysis framework. Looking at the obtained performance figures, we notice that such kind of a system based on the fusion of numerous global image features clearly outperforms the considered methods in the literature. 1. INTRODUCTION In this paper we investigate the problem of automatically an- notating images with keywords. A wealth of automatic im- age annotation methods has been proposed in the literature (e.g. [1, 2, 3]). Often the methods are specifically designed for the auto-annotation application. In this paper we experimen- tally relate the performance of such methods to that of an an- notation system constructed using more generic image anal- ysis tools. In particular, we derive the annotations by using our general-purpose PicSOM image content analysis frame- work (e.g. [4]). The framework has been used also for tasks such as interactive image and video retrieval, industrial qual- ity monitoring and facial image retrieval. The framework pro- duces image similarity assessments by combining partial sim- ilarities defined by elementary image features. In the choice of the elementary image features we take a straightforward approach and use global image features. This is in contrast THIS WORK HAS BEEN SUPPORTED BY THE ACADEMY OF FINLAND IN THE PROJECTS NEURAL METHODS IN INFORMATION RETRIEVAL BASED ON AUTOMATIC CONTENT ANALYSIS AND REL- EVANCE FEEDBACK AND FINNISH CENTRE OF EXCELLENCE IN ADAPTIVE INFORMATICS RESEARCH. SPECIAL THANKS TO KOBUS BARNARD FOR HELPING WITH THE EXPERIMENTAL SETUP. with many of the annotation models that associate keywords with specific image locations. In the literature, the image annotation performance has been evaluated using a variety of means. In this work we review the two most prominent performance measures for su- pervised image annotation: the normalised score (NS) and the precision/recall (PR) statistics, discussing the merits and shortcomings of each. As we find these measures to be somewhat unprincipled, we promote an additional perfor- mance measure, de-symmetrised termwise mutual informa- tion (DTMI), that is based on the information theoretic con- cept of mutual information. This measure can be interpreted as a well-grounded compromise between the opposite ex- tremes of NS and PR, both rewarding somewhat undesirable characteristics of the annotations. In addition to theoretically discussing the three annotation performance measures, we compare them empirically in a de facto standard annotation task of the PR-oriented literature. This way we obtain an empirical correspondence of the anno- tation performance levels in terms of all the three measures. By comparing the PR-results to those reported in the literature we additionally find that our PicSOM framework with global image features outperforms all the other considered methods, some of which represent the state of the art. 2. THE ANNOTATION TASK AND PERFORMANCE EVALUATION In this paper, we study the image annotation task as a su- pervised learning problem. The annotation system is trained with a set of images that are labeled with keywords anno- tating the content of the images. A single image is usually associated with several keywords. After training, the task of the system is to predict a similar set of annotating keywords for a previously unseen test set of images, based on the vi- sual properties of the images. The goodness of the predicted annotations is assessed by comparing them with a manually- specified ground truth. For later usage, we denote the number of images and keywords in the test set with N and W , respec- tively. For brevity, we call the number of keywords annotating 1-4244-1011-8/07/$25.00 '2007 IEEE 251 CBMI 2007