Fast k-NN Classiﬁer for Documents Based on a Graph Structure Fernando Jos´ e Artigas-Fuentes 1 , Reynaldo Gil-Garc´ ıa 1 , Jos´ e Manuel Bad´ ıa-Contelles 2 , and Aurora Pons-Porrata 1 1 Center of Pattern Recognition and Data Mining Universidad de Oriente, Santiago de Cuba, Cuba {artigas,gil,aurora}@csd.uo.edu.cu 2 Computer Science and Engineering Department Universitat Jaume I, Castell´o, Spain badia@uji.es Abstract. In this paper, a fast k nearest neighbors (k-NN) classiﬁer for documents is presented. Documents are usually represented in a high- dimensional feature space, where their terms are treated as features and the weight of each term reﬂects its importance in the document. There are many approaches to ﬁnd the vicinity of an object, but their per- formance drastically decreases as the number of dimensions grows. This problem prevents its application for documents. The proposed method is based on a graph index structure with a fast search algorithm. Its high selectivity permits to obtain a similar classiﬁcation quality than the exhaustive classiﬁer, with a few number of computed distances. Our experimental results show that our method can be applied to problems of very high dimensionality, such as Text Mining. Keywords: nearest neighbor classiﬁer, fast nearest neighbor search, text documents. 1 Introduction Text classiﬁcation is the task of assigning documents to one or more predeﬁned classes. This task relies on the availability of an initial set of text documents classiﬁed under these classes (known as training data). Classiﬁcation falls at the crossroads of information retrieval, pattern recognition and data mining, that involves very large data sets. Moreover, the dimensionality of the text documents is usually large. Therefore, it is crucial to design algorithms which scale well with the dimension. The k nearest neighbor (k-NN) classiﬁer is a very simple and popular approach used in classiﬁcation [1], but it has the problem of the exhaustive computation of distances to training objects. Several methods have been proposed in order to avoid this problem. One approach involves improving the access methods combining appropriate index structures, such as trees or graphs, with fast search algorithms. But, in the most of cases, their performance drastically decrease I. Bloch and R.M. Cesar, Jr. (Eds.): CIARP 2010, LNCS 6419, pp. 228–235, 2010. c  Springer-Verlag Berlin Heidelberg 2010