T. Salakoski et al. (Eds.): FinTAL 2006, LNAI 4139, pp. 257 – 267, 2006. © Springer-Verlag Berlin Heidelberg 2006 Document Clustering Based on Maximal Frequent Sequences Edith Hernández-Reyes, Rene A. García-Hernández, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trinidad National Institute for Astrophysics, Optics and Electronics Luis Enrique Erro No.1 Sta. Ma. Tonantzintla, Puebla, México C.P. 72840 {ereyes, renearnulfo, ariel, fmartine}@inaoep.mx Abstract. Document clustering has the goal of discovering groups with similar documents. The success of the document clustering algorithms depends on the model used for representing these documents. Documents are commonly repre- sented with the vector space model based on words or n-grams. However, these representations have some disadvantages such as high dimensionality and loss of the word sequential order. In this work, we propose a new document repre- sentation in which the maximal frequent sequences of words are used as fea- tures of the vector space model. The proposed model efficiency is evaluated by clustering different document collections and compared against the vector space model based on words and n-grams, through internal and external measures. 1 Introduction Document clustering is an important technique widely used in text mining and infor- mation retrieval systems [1]. Document clustering was proposed to increase the preci- sion and recall of information retrieval systems. Recently, it has been used for brows- ing documents and generating hierarchies [2]. Document clustering consists in dividing a set of documents into groups. In a lan- guage-independent framework, the most common document representation is the vector space model based on words proposed by Salton in 1975 [3]. Here, every document is represented as a vector of features, where the features correspond to the different words of the document collection. Many works use the vector space model based on words as document representation [4] [5] [6]. However, a disadvantage of the vector space model based on words is the high dimensionality because a document collection might contain a huge amount of words. For example, the well-known Reuters-21578[7] document collection is not considered as a big collection but it contains around 38 thousand differ- ent words from 1.4 million words used in the whole collection. In consequence, there are some researches trying to reduce the dimensionality of the vector space model based on words. Another drawback of this representation is that it does not preserve the origi- nal order of the words. For example, documents like “This text is concerned about find gold mining” and “Text mining is concerned about find gold text” are treated as identi- cal in this model, because both are represented with the same words without considering combinations of terms that appear in the document like “text mining” and “gold min- ing” which could help to distinguish them.