Crawling, Indexing, and Similarity Searching Images on the Web (Extended Abstract) M. Batko 1 and F. Falchi 2 and C. Lucchese 2 and D. Novak 1 and R. Perego 2 and F. Rabitti 2 and J. Sedmidubsky 1 and P. Zezula 1 1 Masaryk University, Brno, Czech Republic 2 ISTI-CNR, Pisa, Italy Abstract. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images, to show the possibility to scale Content-based Image Retrieval (CBIR) systems towards the Web size. First, we had to tackle the non-trivial process of image crawling and descriptive feature extrac- tion, performed by using the European EGEE computer GRID, building a test collection, the ﬁrst of such scale, that will be opened to the re- search community for experiments and comparisons. Then, we had to develop indexing and searching mechanisms which can scale up to these volumes and answer similarity queries in real-time. The results of our experiments are very encouraging for future applications. 1 Introduction With the widespread use of digital cameras more than 80 billion photographs are taken each year, and a large part of it will be put on the web. The manage- ment of digital images promises to emerge as a major issue in the next years. In this context, the interest for Content-Based Image Retrieval (CBIR) systems is rapidly growing since they are a possible answer in the management of such data. A recent survey [1] reports on 56 systems, most of them exempliﬁed by pro- totype implementations where the typical size database is counted in thousands of images. However, to be able to be relevant with respect to the new challenges posed by the vast amount of images on the web, CBIR systems should scale up their target, shifting from small scale, often highly speciﬁc, datasets to a much larger scale. The scalability challenge is the focus of the European project SAPIR (Search on Audio-visual content using Peer-to-peer Information Retrieval) 3 that aims at ﬁnding new ways to analyze, index, and retrieve the tremendous amounts of speech, image, video, and music that are ﬁlling our digital universe, going beyond what the most popular search engines are still doing, that is, searching using text tags that have been associated with multimedia ﬁles. 3 SAPIR European Project, IST FP6: http://www.sapir.eu/