In Proceedings of the 1st Information Retrieval Symposium (AIRS), 2004. Document Clustering using Linear Partitioning Hyperplanes and Reallocation Canasai Kruengkrai, Virach Sornlertlamvanich, Hitoshi Isahara Thai Computational Linguistics Laboratory National Institute of Information and Communications Technology 112 Paholyothin Road, Klong 1, Klong Luang, Pathumthani 12120, Thailand {canasai,virach}@tcllab.org, isahara@nict.go.jp ABSTRACT This paper presents a novel algorithm for document clus- tering based on a combinatorial framework of the Principal Direction Divisive Partitioning (PDDP) algorithm [1] and a simplified version of the EM algorithm called the spherical Gaussian EM (sGEM) algorithm. The idea of the PDDP algorithm is to recursively split data samples into two sub- clusters using the hyperplane normal to the principal di- rection derived from the covariance matrix. However, the PDDP algorithm can yield poor results, especially when clusters are not well-separated from one another. To im- prove the quality of the clustering results, we deal with this problem by re-allocating new cluster membership using the sGEM algorithm with different settings. Furthermore, based on the theoretical background of the sGEM algorithm, we can naturally extend the framework to cover the problem of estimating the number of clusters using the Bayesian In- formation Criterion. Experimental results on two different corpora are given to show the effectiveness of our algorithm. 1. INTRODUCTION Unsupervised clustering has been applied to various tasks in the field of Information Retrieval (IR). One of the challeng- ing problems is document clustering that attempts to dis- cover meaningful groups of documents where those within each group are more closely related to one another than documents assigned to different groups. The resulting doc- ument clusters can provide a structure for organizing large bodies of text for efficient browsing and searching [15]. A wide variety of unsupervised clustering algorithms has been intensively studied in the document clustering problem. Among these algorithms, the iterative optimization cluster- ing algorithms have demonstrated reasonable performance for document clustering, e.g. the Expectation-Maximization (EM) algorithm and its variants, and the well-known k- means algorithm. Actually, the k-means algorithm can be considered as a spacial case of the EM algorithm [3] by as- suming that each cluster is modeled by a spherical Gaussian, each sample is assigned to a single cluster, and all mixing pa- rameters (or prior probabilities) are equal. The competitive advantage of the EM algorithm is that it is fast, scalable, and easy to implement. However, one major drawback is that it often gets stuck in local optima depending on the initial random partitioning. Several techniques have been proposed for finding good starting clusters (see [3][7]). Recently, Boley [1] has developed a hierarchal clustering al- gorithm called the Principal Direction Divisive Partition- ing (PDDP) algorithm that performs by recursively splitting data samples into two sub-clusters. The PDDP algorithm has several interesting properties. It applies the concept of the Principal Component Analysis (PCA) but only requir- ing the principal eigenvector, which is not computationally expensive. It can also generate a hierarchal binary tree that inherently produces a simple taxonomic ontology. Cluster- ing results produced by the PDDP algorithm compare fa- vorably to other document clustering approaches, such as the agglomerative hierarchal algorithm and associative rule hypergraph clustering. However, the PDDP algorithm can yield poor results, especially when clusters are not well- separated from one another. This problem will be described in depth later. In this paper, we propose a novel algorithm for document clustering based on a combinatorial framework of the PDDP algorithm and a variant of the EM algorithm. As discussed above, each algorithm has its own strengths and weaknesses. We are interested in the idea of the PDDP algorithm that uses the PCA for analyzing the data. More specifically, it splits the data samples into two sub-clusters based on the hy- perplane normal to the principal direction derived from the covariance matrix of the data. When the principal direction is not representative, the corresponding hyperplane tends to produce individual clusters with wrongly partitioned con- tents. One practical way to deal with this problem is to run the EM algorithm on the partitioning results. We present a simplified version of the EM algorithm called the spherical Gaussian EM algorithm for performing such task. Further- more, based on the theoretical background of the spheri- cal Gaussian EM algorithm, we can naturally extend the framework to cover the problem of estimating the number of clusters using the Bayesian Information Criterion (BIC) [9]. The rest of this paper is organized as follows. Section 2 briefly reviews some important backgrounds of the PDDP algorithm, and addresses the problem causing the incorrect partitioning. Section 3 presents the spherical Gaussian EM algorithm, and describes how to combine it with the PDDP algorithm. Section 4 discusses the idea of applying the BIC to our algorithm. Section 5 explains the data sets and the evaluation method, and shows experimental results. Finally, we conclude in Section 6 with some directions of future work.