International Journal of Computer Applications (0975 – 8887) Volume 46– No.12, May 2012 1 Efficient Clustering Approach using Statistical Method of Expectation-Maximization P.Srinivasa Rao MVGRCE Vizianagaram K.Sivarama Krishna T.R.R.Engg.College Hyderabad Nagesh Vadaparthi MVGRCE Vizianagaram S.Vani Kumari GMRIT Rajam ABSTRACT Clustering is the activity of grouping objects in a dataset based on certain similarity.Available reports on clustering present several algorithms for obtaining effective clusters.Among the existing clustering techniques, hierarchical clustering is one of the widely preferred algorithms.Though there are many algorithms existing,K- Means for hierarchical clustering stand top.But still it is observed that the K-Means algorithm has number of limitations like initialization of parameters. To overcome this limitation, we propose the utilization of E-M algorithm. The K-Means algorithm is implemented by using measure of Cosine similarity and Expectation-Maximization(E-M) with Gaussian Mixture Model.The proposed method has two steps.In first step, the K-Means and E-M methods are combined to partition the input dataset into several smaller sub clusters.In the second step, sub clusters are merged continuously based on maximized Gaussian measure. Key Terms K-Means, Expectation-Maximization, Gaussian Mixture Model, clustering, similarity measure. 1. INTRODUCTION It is very common to differentiate an object with another object by some similarity or dissimilarity. The similar things are grouped together to form clusters. Forming these clusters automatically for a large dataset requires a clustering algorithm [6]. Though there are numerous novel algorithms emerged with additional features and capabilities to form clusters, K-Means [3] with its ever acceptable features of simplicity, understandability, and scalability stood forward among all other algorithms. With the clustering algorithm it is equally important to know the similarity measure used to find the similarity between objects. Many analysts strongly recommend highly performing similarity measure like Cosine Similarity [2]. Though there are even more similarity measures performing more or less equally to cosine similarity[9] here clustering is illustrated by performing similarity check using cosine rule. In this paper, we propose an iterative method ofclusteringcalled Expectation-Maximization (E-M), [4], which is defined in two steps. Expectation step calculates the similarity values between two objects. Maximization step which maximizes the similarity values in finding finite clusters. Our analysis is highly focused on parameterless K-Means with E-M which is briefly given in section 2. The process of clusteringand relative theory is explained in section 3. Section 4 is conclusion and 5 is future work. 2.PARAMETERLESS K-MEANS with E- M 2.1. Initial K-Means The previous approach of K-Means is finding the k clusters of n observations of which each observation belongs to the cluster with the nearest mean. Given a set of n observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS): arg min      −   2   ∈    =1 (1) Whereμi is the mean of points in Si. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated 2.2. Expectation- Maximization E-M involves calculating two steps. Here expectation aims at finding values of similarity between two objects. The next step maximizes the likelihood of the existing similarities. This model, consisting of a set of observed data, a set of unobserved latent data , and a vector of unknown parameters , along with a similarity function ; ,  = ,  (2) the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data