Fast k-NN Classifier for Documents Based on a Graph Structure Fernando Jos´ e Artigas-Fuentes 1 , Reynaldo Gil-Garc´ ıa 1 , Jos´ e Manuel Bad´ ıa-Contelles 2 , and Aurora Pons-Porrata 1 1 Center of Pattern Recognition and Data Mining Universidad de Oriente, Santiago de Cuba, Cuba {artigas,gil,aurora}@csd.uo.edu.cu 2 Computer Science and Engineering Department Universitat Jaume I, Castell´o, Spain badia@uji.es Abstract. In this paper, a fast k nearest neighbors (k-NN) classifier for documents is presented. Documents are usually represented in a high- dimensional feature space, where their terms are treated as features and the weight of each term reflects its importance in the document. There are many approaches to find the vicinity of an object, but their per- formance drastically decreases as the number of dimensions grows. This problem prevents its application for documents. The proposed method is based on a graph index structure with a fast search algorithm. Its high selectivity permits to obtain a similar classification quality than the exhaustive classifier, with a few number of computed distances. Our experimental results show that our method can be applied to problems of very high dimensionality, such as Text Mining. Keywords: nearest neighbor classifier, fast nearest neighbor search, text documents. 1 Introduction Text classification is the task of assigning documents to one or more predefined classes. This task relies on the availability of an initial set of text documents classified under these classes (known as training data). Classification falls at the crossroads of information retrieval, pattern recognition and data mining, that involves very large data sets. Moreover, the dimensionality of the text documents is usually large. Therefore, it is crucial to design algorithms which scale well with the dimension. The k nearest neighbor (k-NN) classifier is a very simple and popular approach used in classification [1], but it has the problem of the exhaustive computation of distances to training objects. Several methods have been proposed in order to avoid this problem. One approach involves improving the access methods combining appropriate index structures, such as trees or graphs, with fast search algorithms. But, in the most of cases, their performance drastically decrease I. Bloch and R.M. Cesar, Jr. (Eds.): CIARP 2010, LNCS 6419, pp. 228–235, 2010. c Springer-Verlag Berlin Heidelberg 2010