“SURFING FOR KNOWLEDGE” FINDING SEMANTICALLY SIMILAR WEB CLUSTERS David Cleary Applied Research Labs, Ericsson Ireland. Ericsson Software Campus, Athlone, Ireland David.Cleary@ericsson.com Diarmuid O' Donoghue Department. of Computer Science, NUI Maynooth, Ireland. Maynooth, Co. Kildare, Ireland. diarmuid.odonoghue@may.ie ABSTRACT In this paper we present our technique for finding semantically similar clusters within web documents obtained from a set of queries retrieved from the Google search engine. This technique utilizes a clustering algorithm based on previous Latent Semantic Analysis (LSA) work pioneered by Deerwester. In this paper we demonstrate how by using our clustering algorithm we can resolve ambiguities prevalent in natural language such as polysemy and synonymy. Following from a detailed description of the algorithm we present our initial findings using real world Internet queries. We conclude by evaluating the merits of our clustering algorithm through comparison with results observed by human categorization. KEYWORDS Information Retrieval, Semantic Web, Latent Semantic Analysis. 1. INTRODUCTION When retrieving information from a search engine, the ability to identify documents related to the meaning of a query is of utmost importance. Typically, the identified documents relate to several different interpretations of the supplied query terms, with documents related to each interpretation randomly scattered across the returned documents. One possible approach is to apply automatic knowledge filters to the information retrieval process. The most challenging issues in web search centre on natural language ambiguity. Both web pages and search queries are expressed in natural language, and thus suffer from ambiguity. Synonymy (multiple lexemes with the same meaning eg. path and pavement) and polysemy (one lexeme with multiple meanings, eg. cook could refer to explorer or food preparation) are inherent ambiguities in most hypertext documents; and attempts to counteract such problems have proved difficult. The Semantic Web (Berners-Lee et al, 2001) initiative has adopted the ontological approach (Gruber, 1993; Deborah et al, 2003), where documents are associated with a vocabulary and a context in which the vocabulary is valid. This approach requires discipline when publishing content and is effective in machine-to-machine interactions. However, due to its artificial nature, it has limited usage for the informal documents that characterize most of the WWW. The reality of the Internet is that documents and terms will often be spread over several topics; thus, some topics will not be as sharply defined as others limiting the use of an ontological approach. In this paper we attempt to find clusters of pages that have related semantic meaning. These clusters rely on extracting topics contained within a sub-set of the web using Latent Semantic Analysis (LSA) techniques pioneered by Deerwester (Deerwester et al, 1990). This statistical model of word usage allows processing of information into structures that take advantage of the implicit higher order associations of words within a document corpus. This technique uses implicit semantic information to discover information clusters, where IADIS WWW/Internet Conference”, pp 1129-1134, Madrid, Spain, 6-9 October, 2004.