International Journal of Computer Applications (0975 – 8887) Volume 129 – No.13, November2015 31 Hierarchical Clustering- An Efficient Technique of Data mining for Handling Voluminous Data Shuhie Aggarwal M.Tech (Computer Science & Engg. KIET,Ghaziabad Parul Phoghat M.Tech (Computer Science & Engg. KIET,Ghaziabad Seema Maitrey Department of CSE KIET, Ghaziabad ABSTRACT The objective of data mining is to take out information from large amounts of data and convert it into form that can be used further. It comes with several functionalities, among which Clustering is worked upon in this paper. Clustering is basically an unsupervised learning where the categories in which the data to put is not known priorly. It is a process where we group set of abstract objects into similar objects such that objects in one cluster are highly similar in comparison to each and dissimilar to objects in other clusters. Clustering can be done by different number of methods such as-partitioning based methods, methods based on hierarchy, density based ,grid based ,model based methods and constraint based clustering. In this survey paper review of clustering and its different techniques is done with special focus on Hierarchical clustering. A number of hierarchical clustering methods that have recently been developed are described here, with a goal to provide useful references to fundamental concepts accessible to the broad community of clustering practitioners. Keywords Data Mining, Clustering Techniques, Hierarchical clustering, Agglomerative, Divisive 1. INTRODUCTION Clustering is a process where the data divides into groups called as clusters such that objects in one cluster are very much similar to each other and objects in different clusters are very much dissimilar to each other[1][13]. Fig 1: Overview of Clustering Clustering is useful in pattern analysis, decision-making, machine-learning situations, including data mining, pattern recognition, document retrieval, image segmentation. However in many cases we have a little knowledge about data given, it is under this situation clustering is particularly useful to find inter-relationship amongst data points[2].Clustering is an important task of data mining to divide the data into meaningful subsets and take out information from it[3].A cluster is henceforth a collection of objects which possess high similarity amongst each other and are very much dissimilar to objects belonging to different clusters[4] i.e. in other words inter cluster similarity is low and intra cluster similarity is high. There are several algorithms to do clustering, and the criteria of deciding a particular algorithm mainly depends on three factors which are- data set size, data dimensionality and time complexity. 2. NEED OF CLUSTERING Clustering is a very important tool that involves analysis of large data of wide variety i.e. the data is multivariate as it comes from heterogeneous sources [5][7].Clustering technique has been employed in wide scientific areas. Data clustering is being used in following three major areas [6]- a) Underlying structure- to gain insight into the data, detect anomalies, identify salient features of the data b) Natural classification- to identify degree of similarity among forms c) compression- as a method for organizing the data and summarizing it through cluster prototypes 3. REQUIREMENTS OF CLUSTERING ALGORITHMS Clustering is in itself a challenging field of research in which its potential applications pose their own special requirements. Main requirements of clustering algorithm are [5][10]- (i) Scalability (ii) Algorithm’s ability to deal with types of attributes (iii) Discover clusters of arbitrary shape (iv) Minimum number of requirements for domain knowledge to determine input parameters (v) Ability to deal with noise and outliers (vi) Insensitivity of the algorithm ie input records can be fed in any order (vii) High dimensionality (viii) Constraint based clustering (ix) Interpretability and usability (x) Incremental clustering 3.1 Steps of Clustering Process [12] (i)Data cleaning and preparing data set for analysis (ii) Creating new relevant variables (iii) Selection of variables (iv) Variable treatment: outlier and missing values (v) Variable standardization (vi) Getting cluster solution (vii) Checking optimality of solution