CLUSTER ANALYSIS – AN OVERVIEW
ANURADHA BHATIA
1
& GAURAV VASWANI
2
1
Faculty, Department of Computer, VES Polytechnic, Mumbai, Maharashtra, India
2
Student, Department of Computer Technology, VESIT, Mumbai, Maharashtra, India
ABSTRACT
Clustering analysis, also called segmentation analysis or taxonomy analysis, aims to identify homogeneous
objects into a set of groups, named clusters, by given criteria. Clustering is a very important technique of knowledge
discovery for human beings. It has a long history and can be traced back to the times of Aristotle .These days; cluster
analysis is mainly conducted on computers to deal with very large-scale and complex datasets. With the development of
computer-based techniques, clustering has been widely used in data mining, ranging from web mining, image processing,
machine learning, artificial intelligence, pattern recognition, social network analysis, bio-informatics, geography, geology,
biology, psychology, sociology, customers behaviour analysis, marketing to e-business and other fields.
KEYWORDS: Cluster Analysis, K Mean, Hierarchical, Genes, Microdata, Problems
INTRODUCTION
The clustering of large sized datasets in data mining is an iterative process involving humans. Thus, the user’s
initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the
pre-processing stage of clustering. Also, the user’s clear understanding on cluster distribution is helpful for assessing the
quality of clustering results in the post-processing of clustering. All these heavily rely on the user’s visual perception of
data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual
presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data.
Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful clusters are the goal, then
the resulting clusters should capture the “natural” structure of the data. For example, cluster analysis has been used to
group related documents for browsing, to find genes and proteins that have similar functionality, and to provide a grouping
of spatial locations prone to earthquakes. However, in other cases, cluster analysis is only a useful starting point for other
purposes, e.g., data compression or efficiently finding the nearest neighbours of points. Whether for understanding or
utility, cluster analysis has long been used in a wide variety of fields: psychology and other social sciences, biology,
statistics, pattern recognition, information retrieval, machine learning, and data mining.
Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine
learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large
datasets with very many attributes of different types. This imposes unique computational requirements on relevant
clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully
applied to real-life data mining problems. They are subject of the survey.
Cluster analysis, like factor analysis, makes no distinction between dependent and independent variables.
The entire sets of interdependent relationships are examined. Cluster analysis is the obverse of factor analysis. Whereas
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 4, Oct 2013, 143-150
© TJPRC Pvt. Ltd.