Document Representation and Dimension Reduction for Text Clustering Mahdi Shafiei, Singer Wang, Roger Zhang, Evangelos Milios, Bin Tang, Jane Tougas, Ray Spiteri Faculty of Computer Science, Dalhousie University Halifax, Canada, http://www.cs.dal.ca/∼shafiei Abstract Increasingly large text datasets and the high dimension- ality associated with natural language create a great chal- lenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clus- tering problem. Several standard benchmark datasets are used. The three Document representation methods consid- ered are based on the vector space model, and they include word, multi-word term, and character N-gram representa- tions. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Fre- quency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Ex- periments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experi- ments also show that the word representation gives better clustering results compared to term and N-gram represen- tation. Finally, for the N-gram representation, it is demon- strated that a profile length (before dimensionality reduc- tion) of 2000 is sufficient to capture the information and, in most cases, a 4-gram representation gives better perfor- mance than 3-gram representation. 1 Introduction Advances in information and communication technolo- gies offer ubiquitous access to vast amounts of informa- tion and are causing an exponential increase in the num- ber of documents available online. While more and more textual information is available electronically, effective re- trieval and mining is getting more and more difficult with- out the efficient organization, summarization, and indexing of document content. Among different approaches used to tackle this problem, document clustering is a principal one. In general, given a document collection, the task of text clustering is to group documents together in such a way that the documents within each cluster are similar to each other. The traditional representation of documents known as bag-of-words considers every document as a vector in a very high-dimensional space; each element of this vector corre- sponds to one word (or, more generally, feature) in the doc- ument collection. This representation is based on the Vector Space Model [17], where vector components represent cer- tain feature weights. Among clustering algorithms applied to the Vector Space representation of documents, Bisecting K-means and regular K-means have been found to outper- form other clustering methods, while they are significantly more efficient computationally, an important consideration with large datasets of high dimensionality [20]. The traditional document representation considers unique words as the components of vectors. Another ap- proach uses N-grams as the vector components. An N-gram is a sequence of symbols extracted from a long string [2]. These symbols can be a byte, character, or word. Extract- ing character N-grams from a document involves moving a n-character wide window across the document character by character. The character N-gram representation has the advantage of being more robust and less sensitive to gram- matical and typographical errors, and it requires no linguis- tic preparation, making it more language independent than other representations. Another approach for representing text documents uses multi-word terms as vector compo- nents, which are noun phrases extracted using a combina- tion of linguistic and statistical criteria. This representation is motivated by the notion that terms should contain more semantic information than individual words. Another ad- vantage of using terms for representing a document is its lower dimensionality compared with the traditional word or N-gram representation. Using any one of these representations, it is not sur- prising to find thousands or tens of thousands of different words, N-grams, or terms for even a relatively small sized text data collection of a few thousand documents, of which a very small subset appears in an individual document. This Mahdi Shafiei, Singer Wang, Roger Zhang, Evangelos Milios, Bin Tang, Jane Tougas, Ray Spiteri: ``Document Representation and Dimension Reduction for Text Clustering'', Workshop on Text Data Mining and Management (TDMM), In conjunctionwith 23rd IEEE ICDE Conference, April 15, 2007, Istanbul, Turkey.