A. Gelbukh (Ed.): CICLing 2006, LNCS 3878, pp. 563 – 566, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Improving kNN Text Categorization
by Removing Outliers from Training Set*
Kwangcheol Shin, Ajith Abraham, and Sang Yong Han
**
School of Computer Science and Engineering, Chung-Ang University,
221, Heukseok-dong, Dongjak-gu, Seoul 156-756, Korea
kcshin@archi.cse.cau.ac.kr,
ajith.abraham@ieee.org, hansy@cau.ac.kr
Abstract. We show that excluding outliers from the training data significantly
improves kNN classifier, which in this case performs about 10% better than the
best know method—Centroid-based classifier. Outliers are the elements whose
similarity to the centroid of the corresponding category is below a threshold.
1 Introduction
Since late 1990s, the explosive growth of Internet resulted in a huge quantity of
documents available on-line. Technologies for efficient management of these docu-
ments are being developed continually. One of representative tasks for efficient
document management is text categorization, called also classification: given a set of
training examples assigned each one to some categories, to assign new documents to a
suitable category.
A well-known text categorization method is kNN [1]; other popular methods are Na-
ive Bayesian [3], C4.5 [4], and SVM [5]. Han and Karypis [2] proposed the Centroid-
based classifier and showed that it gives better results than other known methods.
In this paper we show that removing outliers from the training categories signifi-
cantly improves the classification results obtained with kNN method. Our experiments
show that the new method gives better results than the Centroid-based classifier.
2 Related Work
Document representation. In both categorization techniques considered below,
documents are represented as keyword vectors according to the standard vector space
model with tf-idf term weighting [6, 7]. Namely, let the document collection contains
in total N different keywords. A document d is represented as an N-dimensional vec-
tor of term weight t with coordinates
*
Work supported by the MIC (Ministry of Information and Communication), Korea, under the
Chung-Ang University HNRC-ITRC (Home Network Research Center) support program su-
pervised by the IITA (Institute of Information Technology Assessment).
**
Corresponding author.