Rich Document Representation for Document Clustering Azam Jalali Department of Computer and Electrical Engineering, Faculty of Engineering, University of Tehran, Jalali@itrc.ac.ir Farhad Oroumchian University of Wollongong in Dubai POBox 20183, Dubai, UAE FarhadOroumchian@uowdubai.ac.ae Abstract In traditional document clustering models, a document is considered as a bag of words. In this paper we present a new method for generating feature vectors, using the sentence fragments that are called logical terms and statements, in PLIR system. PLIR is a Knowledge-Based Information system based on the theory of the Plausible Reasoning. We have conducted a number of experiments using OHSUMED document collection and the clustering methods K-Means with seven different similarity measures between documents. The Experiments seem to indicate that the use of richer features such as logical terms or statements for clustering tends to perform better than the simp le bag of words approaches within our domain of experiments that is second phase of a two- stage retrieval system. 1 Introduction Document Clustering, which is the process of finding natural groupings in documents, is an important task in information retrieval. The cluster hypothesis states that the relevant documents tend to be more similar to each other than to non-relevant documents, and therefore tend to appear in the same clusters. There has been general research on how to utilize clustering to improve retrieval results. In most of the pervious attempts the strategy was to build a static clustering of the entire collection and then match the query to the cluster centroids. Many researchers have examined the effectiveness of hierarchic clustering methods and have compared it to conventional Inverted File Search (Croft, 1980). Recent application of clustering has been investigated as a methodology for improving retrieved document search and browse (Tombros 2002; Cutting et all 1992). In cluster-based search, a single cluster is retrieved in response to a query. The documents within the retrieved cluster are not ranked in relation to the query but rather the whole cluster is retrieved as an entity. Cluster representation refers to the formation of cluster representatives, or centroids that attempt to summarize the contents of a cluster for the purpose of retrieving the cluster. Incoming queries are matched against representatives, and the cluster whose representatives are most similar to the query, is retrieved (Tombros 2002). Three different types of cluster-based searches have been studied in IR: Top-down search, Bottom-up search and optimal cluster search Cluster-based browsing paradigm clusters documents into topically-coherent groups, and presents descriptive textual summaries to the user (Cutting et all 1992). Informed by the summaries the user may select clusters, forming a sub collection for interactive examination. The clustering and re- clustering is done on the fly. Here, cluster representation refers to textual or graphical representations of the cluster contents in a manner such that they will support judgments by the user on the utility of the cluster contents There are many algorithms for automatic clustering such as partitioning algorithm and hierarchical clustering that can be applied to a set of vectors to form the clusters. Traditionally the documents are represented as bag of weighted words, the weights could be calculated based on the frequency of the