Centroid-based Classification Enhanced with Wikipedia
Abdullah Bawakid and Mourad Oussalah
School of Engineering
Department of Electronic, Electrical and Computer Engineering
University of Birmingham
{ axb517 , M.oussalah }@bham.ac.uk
Abstract— Most of the traditional text classification methods
employ Bag of Words (BOW) approaches relying on the words
frequencies existing within the training corpus and the testing
documents. Recently, studies have examined using external
knowledge to enrich the text representation of documents.
Some have focused on using WordNet which suffers from
different limitations including the available number of words,
synsets and coverage. Other studies used different aspects of
Wikipedia instead. Depending on the features being selected
and evaluated and the external knowledge being used, a
balance between recall, precision, noise reduction and
information loss has to be applied. In this paper, we propose a
new Centroid-based classification approach relying on
Wikipedia to enrich the representation of documents through
the use of Wikpedia’s concepts, categories structure, links, and
articles text. We extract candidate concepts for each class with
the help of Wikipedia and merge them with important features
derived directly from the text documents. Different variations
of the system were evaluated and the results show
improvements in the performance of the system.
Keywords-component; Classification, Semantics, Wikipedia,
Categorization, text enrichment (key words)
I. INTRODUCTION
The amount of newly created information in electronic
form increases in a large pace everyday. In particular, the
available text on the web created the necessity to implement
different methodologies to organize information into
different useful forms. Among the various approaches that
have been designed for managing the information is
Automatic Text Classification (ATC) which is the process of
assigning previously defined classes to documents. ATC has
been used recently in many web applications including
search engines queries [1] and web documents classification
[2]. These applications usually require fast training and
classification in addition to high precision and recall.
Many traditional text classification systems focus on Bag
of Words (BOW) techniques which represent the documents
or classes with weighted features extracted from documents
terms and their frequencies. Among the most popular BOW
techniques are SVM [3], Neural Networks [4], knn [5],
Naïve-Based [6] and centroid-based methods [7],[8].
Centroid-based methods were generally found in the
literature to be faster and more efficient than most of the rest
of the methods. Their precision and recall were also lacking
when compared with other methods such as SVN [8]. In a
recently published study, a new centroid-based method was
proposed and found after evaluation to give better accuracy
than many of the other state-of-art BOW approaches [7].
While the efficiency and performance of many BOW
methods may be somewhat high for tasks when the category
of a group of documents can be identified with a few distinct
keywords appearing in the belonging documents, this is not
always the case. For example, consider the case when a class
labeled “Abnormal Psychology” has several training
documents in which none has the keyword “Hyperthymesia”
(meaning superior memory). If a document discussing
Hyperthymesia were to be classified, a BOW method would
not be able to distinguish the relationship between the class
“Abnormal Psychology” and the word “Hyperthymesia”. On
the other hand, a human with good background knowledge in
Hyperthymesia should be able to tell which class the
document actually belongs to. Also, consider the case when
two consecutive words such as “Cat Fish” provide a new
meaning different from the two separate words. With only
traditional BOW methods, multi-word concepts are usually
misinterpreted or simply omitted. Hence, the use of external
knowledge to enrich classification methods should help
address similar scenarios where semantic understanding of
the content of the documents and the relationship between its
contents and the different classes is needed. This semantic
analysis is especially important when the training or testing
documents are short in length providing not much enough
info for training with BOW methods.
In this paper, we describe a novel system that employs
Wikipedia as its underlying knowledge base in a unique way.
The large number of concepts and diverse domains covered
in Wikipedia makes it most suitable for the task. Instead of
mapping the documents text to a concept or a small group of
concepts as done in most of the previous work, we map it to
all of the previously-processed Wikipedia concepts. This is
achieved by first processing all Wikipedia articles and
extracting the relationship between each of its terms and all
the concepts existing within Wikipedia. In essence, this
forms a term-concepts table. Then, we extract the categories
structure within Wikipedia and analyze its links.
Furthermore, we employ a centroid-based method directly on
the documents contents and give the terms weights based on
inter-class, inner-document and inter-document features. The
result of all the mentioned steps (Concepts, Categories and
Text) is then combined to form prototype vectors for each
class during the training stage. The classification uses the
formed vectors to decide which class each Test Document
(TD) belongs to in an efficient way. Our experimental results
on the 20-newsgroups dataset and the ODP collection
2010 Ninth International Conference on Machine Learning and Applications
978-0-7695-4300-0/10 $26.00 © 2010 IEEE
DOI 10.1109/ICMLA.2010.17
65