International Conference and Workshop on Emerging Trends in Technology (ICWET 2011) – TCET, Mumbai, India 505 Clustering with Apache Hadoop S Nair Computer Engineering Dept Shah And Anchor Kutchhi Engineering College Chembur, Mumbai, India Sindhu_knair@hotmail.com J Mehta IT Dept Shah and Anchor Kutchhi Engineering College, Chembur, Mumbai, India Jalpa03@yahoo.com ABSTRACT The self-organizing map (SOM) is an unsupervised neural network which projects high-dimensional data onto a low- dimensional grid and visually reveals the topological order of the original data. Thus, SOM is an excellent tool in the exploratory phase of data mining. Self-organizing maps have been successfully applied to many fields, including engineering and business domains. Experimental results on census database illustrate the results of clustering. The paper proposes to improve the performance of clustering by the latest approach of cloud computing. The approach focuses on Hadoop that provides a Java-based software framework to distribute processing over a cluster of processors by providing a open source implementation of MapReduce, a powerful tool designed for the detailed analysis and transformation of very large data sets. Categories and Subject Descriptors H.2.8 [Data Mining] General Terms Algorithms, Measurement, Performance, Design, Reliability, Experimentation, Standardization, Languages, Theory. Keywords Data Mining, Cluster Analysis, Self-Organizing Maps, Cloud Computing, Hadoop, Virtualization, Map Reduce. 1.INTRODUCTION evaluation, and knowledge deployment. Various tools are used for the data survey step in data mining which have prominent visualization properties [7]. Data mining can be classified into two categories: descriptive data mining and predictive data mining . The former describes the data set in a concise and summarized manner and presents interesting general properties of the data. Cluster analysis is a part of descriptive data mining [1]. Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing and market research. By clustering, one can identify dense and sparse regions and, therefore, discover overall distribution patterns and interesting correlations among data attributes. In machine learning, clustering is an example of unsupervised learning. For this reason, clustering is a form of learning by observation rather than learning by examples. In conceptual clustering, a group of objects forms a class only if it is describable by a concept. This differs from conventional clustering, which measures similarity based on geometric distance. Conceptual clustering consists of two parts : (a) it discovers the appropriate classes (b) it forms descriptions for each class, as in classification [9]. Kohonen’s self-organizing map (SOM) is an unsupervised neural network which projects high-dimensional data onto a low-dimensional grid. The projected data preserves the topological relationship of the original data. Hence, this ordered grid can be used as a convenient visualization surface for showing various features of the training data, for example, cluster structures [6]. The SOM is especially suitable for the data survey step in data mining as it has prominent visualization properties [7]. The conventional SOM training algorithm handles only numeric data since the distance computation to form clusters is based on the Euclidean distance. SOM is unable to process categorical data eg. For student data in a campus database, the department attribute is categorical. For sales transaction in a sales database, the product attribute is categorical while the sales-amount attribute is numeric. By generalizing the SOM model to Generalized Self-Organizing Map (GSOM) categorical data of various applications can be handled effectively for data mining. Thus, the SOM can handle categorical data and mixed data such that it can process more diverse data and expand the applicability. The applications of SOM include image processing, process monitoring and control, speech recognition, flaw detection in machinery, business and management, information retrieval , medical diagnosis [4], time-series prediction, optimization as well as financial forecasting and management [3]. The performance of clustering can be improved by the latest approach of cloud computing.The approach focuses on Hadoop that provides a Java-based software framework to distribute processing over a cluster of processors by providing a open source implementation of MapReduce, a powerful tool designed for the detailed analysis and transformation of very large data sets. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICWET’11, February 25–26, 2011, Mumbai, Maharashtra, India. Copyright ゥ 2011 ACM 978-1-4503-0449-8/11/02…$10.00.