Centroid-based Classification Enhanced with Wikipedia Abdullah Bawakid and Mourad Oussalah School of Engineering Department of Electronic, Electrical and Computer Engineering University of Birmingham { axb517 , M.oussalah }@bham.ac.uk Abstract— Most of the traditional text classification methods employ Bag of Words (BOW) approaches relying on the words frequencies existing within the training corpus and the testing documents. Recently, studies have examined using external knowledge to enrich the text representation of documents. Some have focused on using WordNet which suffers from different limitations including the available number of words, synsets and coverage. Other studies used different aspects of Wikipedia instead. Depending on the features being selected and evaluated and the external knowledge being used, a balance between recall, precision, noise reduction and information loss has to be applied. In this paper, we propose a new Centroid-based classification approach relying on Wikipedia to enrich the representation of documents through the use of Wikpedia’s concepts, categories structure, links, and articles text. We extract candidate concepts for each class with the help of Wikipedia and merge them with important features derived directly from the text documents. Different variations of the system were evaluated and the results show improvements in the performance of the system. Keywords-component; Classification, Semantics, Wikipedia, Categorization, text enrichment (key words) I. INTRODUCTION The amount of newly created information in electronic form increases in a large pace everyday. In particular, the available text on the web created the necessity to implement different methodologies to organize information into different useful forms. Among the various approaches that have been designed for managing the information is Automatic Text Classification (ATC) which is the process of assigning previously defined classes to documents. ATC has been used recently in many web applications including search engines queries [1] and web documents classification [2]. These applications usually require fast training and classification in addition to high precision and recall. Many traditional text classification systems focus on Bag of Words (BOW) techniques which represent the documents or classes with weighted features extracted from documents terms and their frequencies. Among the most popular BOW techniques are SVM [3], Neural Networks [4], knn [5], Naïve-Based [6] and centroid-based methods [7],[8]. Centroid-based methods were generally found in the literature to be faster and more efficient than most of the rest of the methods. Their precision and recall were also lacking when compared with other methods such as SVN [8]. In a recently published study, a new centroid-based method was proposed and found after evaluation to give better accuracy than many of the other state-of-art BOW approaches [7]. While the efficiency and performance of many BOW methods may be somewhat high for tasks when the category of a group of documents can be identified with a few distinct keywords appearing in the belonging documents, this is not always the case. For example, consider the case when a class labeled “Abnormal Psychology” has several training documents in which none has the keyword “Hyperthymesia” (meaning superior memory). If a document discussing Hyperthymesia were to be classified, a BOW method would not be able to distinguish the relationship between the class “Abnormal Psychology” and the word “Hyperthymesia”. On the other hand, a human with good background knowledge in Hyperthymesia should be able to tell which class the document actually belongs to. Also, consider the case when two consecutive words such as “Cat Fish” provide a new meaning different from the two separate words. With only traditional BOW methods, multi-word concepts are usually misinterpreted or simply omitted. Hence, the use of external knowledge to enrich classification methods should help address similar scenarios where semantic understanding of the content of the documents and the relationship between its contents and the different classes is needed. This semantic analysis is especially important when the training or testing documents are short in length providing not much enough info for training with BOW methods. In this paper, we describe a novel system that employs Wikipedia as its underlying knowledge base in a unique way. The large number of concepts and diverse domains covered in Wikipedia makes it most suitable for the task. Instead of mapping the documents text to a concept or a small group of concepts as done in most of the previous work, we map it to all of the previously-processed Wikipedia concepts. This is achieved by first processing all Wikipedia articles and extracting the relationship between each of its terms and all the concepts existing within Wikipedia. In essence, this forms a term-concepts table. Then, we extract the categories structure within Wikipedia and analyze its links. Furthermore, we employ a centroid-based method directly on the documents contents and give the terms weights based on inter-class, inner-document and inter-document features. The result of all the mentioned steps (Concepts, Categories and Text) is then combined to form prototype vectors for each class during the training stage. The classification uses the formed vectors to decide which class each Test Document (TD) belongs to in an efficient way. Our experimental results on the 20-newsgroups dataset and the ODP collection 2010 Ninth International Conference on Machine Learning and Applications 978-0-7695-4300-0/10 $26.00 © 2010 IEEE DOI 10.1109/ICMLA.2010.17 65