Web image context extraction based on semantic representation of web page visual segments Georgina Tryfou, Zenonas Theodosiou and Nicolas Tsapatsoulis Department of Communication and Internet Studies Technical University of Cyprus Limassol, Cyprus Email: {georgia.tryfou, zenonas.theodosiou, nicolas.tsapatsoulis}@cut.ac.cy Abstract—Among the most challenging scientiﬁc interests of the past years, special attention has been given to the task of web image information mining. Web images exist in huge amounts on the web and several methods for their efﬁcient description and representation have been proposed so far. In many of the exploited algorithms, web image information is extracted from textual sources such as image ﬁle names, anchor texts, existing keywords and, of course, surrounding text. However, the systems that attempt to mine information for images using surrounding text suffer from several problems, such as the inability to correctly assign all relevant text to an image and discard the irrelevant text at the same time. A novel method for indexing web images is discussed in the present paper. The proposed system uses visual cues in order to obtain a web page segmentation. The segments are represented with semantic metrics and a k-means clustering assigns these segments to the web image they refer to. The evaluation procedure indicates that the semantic representation method of the visual segments delivers a good description for the web images. Keywords-web image context extraction, visual segmentation, semantic representation, vocabulary reduction, WordNet I. I NTRODUCTION The rapid growth of World Wide Web, the developmentof cheap digital recording and storage devices and the extended use of social networks, have enabled the production of a huge amount of digital image collections. While more and more images, covering every conceivable topic, are uploaded every minute on the web, billions of users demand from web search engines to offer instant and intuitive image search by minimizing the necessary interaction for the optimization of the results. The literature reveals two main approaches which are employed for the representation of web images: text-based methods and content-based methods. In the content-based approaches image features such as color, shape or texture are used for indexing and searching web images. The user provides a target image and the system retrieves the best ranked images based on their similarity from the user’s query. The use of content-based approaches to web image indexing and retrieval has several disadvantages. Firstly, the extraction of the visual features is a time consuming procedure, strongly related to the domain that the query image belongs to. Moreover, it is not always possible for a user to obtain at any time an image similar to the one he/she is searching for. Finally, although it has been a long time since scientists working on this approach deﬁned the semantic gap [1], i.e. the inability of a system to interpret images based on automatically extracted features, a solution still does not exist. The text-based approaches on the other hand, use the associate text as a source for deriving the content of images. Image ﬁle names, anchor texts, surrounding paragraphs or even the whole text of the hosting web page are examples of textual content that may be used in such systems. The user provides keywords or key phrases and text retrieval techniques are used for the retrieval of the best ranked image. One of the key issues in text-based methods for web image indexing is how the text blocks will be extracted from a web page in order to be used as concept sources for images. There are several approaches that attempt to address this. In a ﬁrst approach, similar to the one described in [2], ﬁxed-size sequence of terms are used. Although it is a time efﬁcient method it yields poor results since the extracted text may be irrelevant to the image, or on the other hand, important parts of the relevant text may be discarded. Systems that follow a second approach, as [3] and [4], make use of the DOM tree structure of the hosting web page. In general these methods are not adaptive and they are designed for speciﬁc design patterns. Web page segmentation is a third approach to text extrac- tion from web pages. This method is employed in [5], where the authors use Vision based Page Segmentation (VIPS) [6] in order to extract blocks which contain both image and text and construct an image graph using link structure. Web page segmentation is indeed a more adequate solution to the problem of text extraction since it is adaptable to different web page styles and depends on the visual cues that form each web page. Most of the proposed algorithms with this approach though, are not designed speciﬁcally for the problem of image indexing and therefore often deliver poor results [7]. In the proposed system, the whole text that is found in the web page is used as a source to extract content