Comparison of Feature Sets using Multimedia Translation Pınar Duygulu 1 , ¨ Ozge Can ¨ Ozcanlı 2 , and Norman Papernick 1 1 School of Computer Science, Carnegie Mellon University, Informedia Project, Pittsburgh, PA, USA {pinar, norm}@cs.cmu.edu 2 Dept. of Computer Engineering, Middle East Technical University, Ankara, Turkey {ozge}@ceng.metu.edu.tr Abstract. Feature selection is very important for many computer vision applications. However, it is hard to find a good measure for the compari- son. In this study, feature sets are compared using the translation model of object recognition which is motivated by the availablity of large anno- tated data sets. Image regions are linked to words using a model which is inspired by machine translation. Word prediction performance is used to evaluate large numbers of images. 1 Introduction Due to the developing technology, there are many available sources where im- ages and text occur together: there is a huge amount of data on the web, where images occur with a surrounding text; with OCR technology it is possible to extract the text from images; and above all, almost all the images have captions which can be used as annotations. Also, there are several large image collections (e.g. Corel data set, most museum image collections,the web archive) where each image is manually annotated with some descriptive text. Using text and images together helps disambiguation in image analysis and also makes several inter- esting applications possible, including better clustering, search, auto-annotation and auto-illustration [4, 5, 9]. In the annotated image collections, although it is known that the annotation words are associated with the image, the correspondence between the words and the image regions are unknown. There are some methods that are proposed to solve the correspondence problem [4, 12, 14]. We consider the problem of finding such correspondences as the translation of image regions to words, similar to the translation of text from one language to another [9, 10]. As in many problems, feature selection plays an important role in translating image regions to words. In this study, we investigate the effect of feature sets on the performance of linking image regions to words. Two different feature sets are compared. One is a set of descriptors chosen from MPEG-7 feature extraction schemes, since it is mostly used for content based image retrieval tasks; and the other one is a set obtained by combining most of the descriptive and helpful features chosen heuristically for their adequacy to the task.