Proceedings ELPUB2006 Conference on Electronic Publishing – Bansko, Bulgaria – June 2006 Semi-Automatic Extraction of Thesauri and Semantic Search in a Digital Image Archive José C. González 1,2 , Julio Villena 1,3 , Cristina Moreno 1 , José L. Martínez-Fernández 1,3 1 DAEDALUS-Data, Decisions and Language, S.A. Centro de Empresas La Arboleda, Ctra. N-III, Km. 7,300 E-28031 Madrid, Spain 2 ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain 3 Universidad Carlos III de Madrid, Spain email: {jvillena,jgonzalez,cmoreno,jmartinez}@daedalus.es Abstract The topics addressed in this paper are threefold: First, techniques for the semi-automatic normalization of image descriptors in a digital image collection fromfree-text titles and keywords. Second, the efficient construction of thesauri for specific image collections. And third, the optimisation of search mechanisms to deal with the special characteristics of image collections and with the use through web-based search interfaces. The solutions presented here have been developed in the framework of a commercial project intended to improve image search in a website for selling photographs through the web. The ultimate goal of this project is to improve customer accessibility to a collection of more than two million photographs. This project has been developed by the company DAEDALUS, S.A. for the Internet website www.stockphotos.es, of the company Stock Photos S.L. Keywords: digital image library; information retrieval; thesaurus; subject hierarchy; normalisation process; translation; automatic classification 1 Introduction The problem addressed in this paper consists of improving the chances of a given customer finding the photograph which he/she needs to illustrate a publication or an advertising campaign in an archive with several millions of images, retrieving it in the shortest time. This objective necessarily demands that the images are tagged in the best way to match user queries. Tagging a photograph involves some work; it implies specifying the objects shown, the environment in which they are located, the relationships among them, actions or effects which could have place at that moment, feelings evoked, light, colour range, photographic techniques, etc. Tagging a full image collection means a huge human effort. Most of the time, tagging is done with the simple aid of basic word processors, using poor or no guidelines, applying no spell checking and with limited quality controls, usually on a random sampling basis. The whole process is clearly error-prone. These problems have a higher impact in the case of (western) languages with diacritics, rich inflexion, or both (as most Romance languages). The reason is that full matching of queries against index terms becomes more difficult, even if usual diacritic/typography elimination and stemming algorithms are applied. Another source of noise is language translation. Photo collections are distributed and resold in international markets, what means that, for the sake of accessibility, it is necessary to translate all tags from the source tagging language to the target market language. Now, commercial organizations are worried about the real chances of their clients reaching the content they are looking for. In the case of this project, the approach to the problem was threefold: 1. Normalization of image descriptors fromfree-text titles and keywords. 2. Construction of thesauri. 3. Optimisation of indexing and search mechanisms. Obviously, these steps had to be tackled with the highest degree of automation. The tagging effort made in the past (perhaps with criteria, depth, precision or quality which are less than optimal) cannot be repeated at all. A greater investment in terms of a fixed amount per image is not possible due to economical reasons. However, a