International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3637
Optimal Number of Cluster Identification using Robust K-means for
Sequences in Categorical Sequences
S.U. Patil
1
, U.A. Nuli
2
1,2
Computer Science and Engineering department, M. Tech, Textile and Engineering Institute, Ichalkaranji,
Maharashtra, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - This paper presents a modified k-means
algorithm for clustering. In traditional method of
clustering number of clusters to be formed will be given at
the start of the algorithm, which affects performance and
efficiency of the algorithm. In Robust K-means for
sequences optimal number of cluster will be predicted by
removing noise cluster. Cluster validation, which is the
process of evaluating the quality of clustering results,
plays an important role for practical machine learning
systems. Categorical sequences, such as biological
sequences in computational biology, have become
common in real-world applications. Different from
previous studies, which mainly focused on attribute-value
data, in this paper, we work on the cluster validation
problem for categorical sequences. Clustering is defined
as an unsupervised learning where the objects are
grouped on the basis of some similarity inherent among
them. The intension of this paper is to describe the
clustering method which will give the optimal number of
clusters in categorical sequences.
Key Words: Clusterign , K-means, cluster validation index,
categorical sequences, centroid.
1. INTRODUCTION
Data mining is a process of deriving required data from a
collection of large dataset and making analysis on collected
data. The data mining technique is to mine information from
a bulky data set and make over it into a reasonable form for
supplementary purpose. In generic term this is known as
classification. In classification when the classes of an object
is given in advance is termed as supervised classification
where as the other case when the class label is not tagged to
an object in advance is termed as unsupervised
classification. The unsupervised classification is commonly
known as clustering. Clustering is important analysis
techniques that is employed to large datasets and finds its
application in the fields like search engines,
recommendation systems, data mining, knowledge
discovery, bioinformatics and documentation. Nowadays, the
data being generated is not only huge in volume, but is also
stored across various machines all around the world. The
main purpose behind the study of classification is to develop
a tool or an algorithm, which can be used to predict the class
of an unknown object, which is not labeled.
Clustering problem cannot be solved by one specific
algorithm but it requires various algorithms that differ
significantly in their notion of what makes a cluster and how
to efficiently find them. Generally clusters include groups
with small distances among the cluster members, dense
areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a
multi-objective optimization problem. The appropriate
clustering algorithm and parameter settings depend on the
individual data set and intended use of the results. Clustering
is considered to be more difficult than supervised
classification as there is no label attached to the patterns in
clustering. The given label in the case of supervised
classification becomes a clue to grouping data objects as a
whole. Whereas in the case of clustering, it becomes difficult
to decide, to which group a pattern will belong to, in the
absence of a label.
The problem of clustering becomes more
challenging when the data is categorical, that is, when there
is no inherent distance measure between data values. This is
often the case in many domains, where data is described by a
set of descriptive attributes, many of which are neither
numerical nor inherently ordered in any way. Moreover
clustering categorical sequences is a challenging problem
due to one more reason the difficulties in defining an
inherently meaningful measure of similarity between
sequences. Cluster validation, which is the process of
evaluating the quality of clustering results, plays an
important role for practical machine learning systems.
Categorical sequences, such as biological sequences in
computational biology, have become common in real-world
applications. Without a measure of distance between data
values, it is unclear how to define a quality measure for
categorical clustering. To do this, we employ mutual
information, a measure from information theory. A good
clustering is one where the clusters are informative about
the data objects they contain. Since data objects are
expressed in terms of attribute values, we require that the
clusters convey information about the attribute values of the
objects in the cluster. The evaluation of sequences
clustering is currently difficult due to the lack of an internal
validation criterion defined with regard to the structural
features hidden in sequences. To solve this problem, a novel
cluster validity index (CVI) is proposed as a function of
clustering, with the intra-cluster structural compactness and
inter-cluster structural separation linearly combined to
measure the quality of sequence clusters. Cluster validation,