International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3637 Optimal Number of Cluster Identification using Robust K-means for Sequences in Categorical Sequences S.U. Patil 1 , U.A. Nuli 2 1,2 Computer Science and Engineering department, M. Tech, Textile and Engineering Institute, Ichalkaranji, Maharashtra, India ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - This paper presents a modified k-means algorithm for clustering. In traditional method of clustering number of clusters to be formed will be given at the start of the algorithm, which affects performance and efficiency of the algorithm. In Robust K-means for sequences optimal number of cluster will be predicted by removing noise cluster. Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. The intension of this paper is to describe the clustering method which will give the optimal number of clusters in categorical sequences. Key Words: Clusterign , K-means, cluster validation index, categorical sequences, centroid. 1. INTRODUCTION Data mining is a process of deriving required data from a collection of large dataset and making analysis on collected data. The data mining technique is to mine information from a bulky data set and make over it into a reasonable form for supplementary purpose. In generic term this is known as classification. In classification when the classes of an object is given in advance is termed as supervised classification where as the other case when the class label is not tagged to an object in advance is termed as unsupervised classification. The unsupervised classification is commonly known as clustering. Clustering is important analysis techniques that is employed to large datasets and finds its application in the fields like search engines, recommendation systems, data mining, knowledge discovery, bioinformatics and documentation. Nowadays, the data being generated is not only huge in volume, but is also stored across various machines all around the world. The main purpose behind the study of classification is to develop a tool or an algorithm, which can be used to predict the class of an unknown object, which is not labeled. Clustering problem cannot be solved by one specific algorithm but it requires various algorithms that differ significantly in their notion of what makes a cluster and how to efficiently find them. Generally clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended use of the results. Clustering is considered to be more difficult than supervised classification as there is no label attached to the patterns in clustering. The given label in the case of supervised classification becomes a clue to grouping data objects as a whole. Whereas in the case of clustering, it becomes difficult to decide, to which group a pattern will belong to, in the absence of a label. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. This is often the case in many domains, where data is described by a set of descriptive attributes, many of which are neither numerical nor inherently ordered in any way. Moreover clustering categorical sequences is a challenging problem due to one more reason the difficulties in defining an inherently meaningful measure of similarity between sequences. Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Without a measure of distance between data values, it is unclear how to define a quality measure for categorical clustering. To do this, we employ mutual information, a measure from information theory. A good clustering is one where the clusters are informative about the data objects they contain. Since data objects are expressed in terms of attribute values, we require that the clusters convey information about the attribute values of the objects in the cluster. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intra-cluster structural compactness and inter-cluster structural separation linearly combined to measure the quality of sequence clusters. Cluster validation,