T. Salakoski et al. (Eds.): FinTAL 2006, LNAI 4139, pp. 257 – 267, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Document Clustering Based on Maximal Frequent
Sequences
Edith Hernández-Reyes, Rene A. García-Hernández, J.A. Carrasco-Ochoa,
and J.Fco. Martínez-Trinidad
National Institute for Astrophysics, Optics and Electronics
Luis Enrique Erro No.1 Sta. Ma. Tonantzintla, Puebla, México C.P. 72840
{ereyes, renearnulfo, ariel, fmartine}@inaoep.mx
Abstract. Document clustering has the goal of discovering groups with similar
documents. The success of the document clustering algorithms depends on the
model used for representing these documents. Documents are commonly repre-
sented with the vector space model based on words or n-grams. However, these
representations have some disadvantages such as high dimensionality and loss
of the word sequential order. In this work, we propose a new document repre-
sentation in which the maximal frequent sequences of words are used as fea-
tures of the vector space model. The proposed model efficiency is evaluated by
clustering different document collections and compared against the vector space
model based on words and n-grams, through internal and external measures.
1 Introduction
Document clustering is an important technique widely used in text mining and infor-
mation retrieval systems [1]. Document clustering was proposed to increase the preci-
sion and recall of information retrieval systems. Recently, it has been used for brows-
ing documents and generating hierarchies [2].
Document clustering consists in dividing a set of documents into groups. In a lan-
guage-independent framework, the most common document representation is the vector
space model based on words proposed by Salton in 1975 [3]. Here, every document is
represented as a vector of features, where the features correspond to the different words
of the document collection. Many works use the vector space model based on words as
document representation [4] [5] [6]. However, a disadvantage of the vector space model
based on words is the high dimensionality because a document collection might contain
a huge amount of words. For example, the well-known Reuters-21578[7] document
collection is not considered as a big collection but it contains around 38 thousand differ-
ent words from 1.4 million words used in the whole collection. In consequence, there
are some researches trying to reduce the dimensionality of the vector space model based
on words. Another drawback of this representation is that it does not preserve the origi-
nal order of the words. For example, documents like “This text is concerned about find
gold mining” and “Text mining is concerned about find gold text” are treated as identi-
cal in this model, because both are represented with the same words without considering
combinations of terms that appear in the document like “text mining” and “gold min-
ing” which could help to distinguish them.