VOL. 7, NO. 9, SEPTEMBER 2012 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences © 2006-2012 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 1162 AN INTEGRATED GHSOM-MLP WITH MODIFIED LM ALGORITHM FOR MIXED DATA CLUSTERING D. Hari Prasad 1 and M. Punithavalli 2 1 Department of Computer Applications, Sri Ramakrishna Institute of Technology, Coimbatore, India 2 Department of Computer Applications, Sri Ramakrishna Engineering College, Coimbatore, India E-Mail: hari.research@yahoo.com ABSTRACT Data clustering is one of the common approaches used to carry out statistical data analysis, which is used in several fields, together with machine learning, data mining, customer requirement, trend investigation, pattern identification and image analysis. Even though many clustering approaches have been available, but most of them manage only the clustering of numerical data. On the other hand, the problem of clustering mixed data is more complicated and difficult as mixed data have nominal attributes. A large number of these algorithms deals with only on numeric data, a small number of algorithms take care of nominal data, and only a very least amount of algorithms can handle both numeric and nominal values. In order to provide an efficient mixed data clustering, there is a significant need for several approaches to handle mixed data clustering. The existing mixed data clustering techniques takes more time for clustering the usage of SOM has the inability to capture the inherent hierarchical structure of data. To overcome this, an integrated GHSOM-MLP with Modified LM Algorithm is proposed in this paper. The experimentation for the proposed technique is carried with the help of UCI Adult Data Set to compare the proposed approach with GHSOM, in terms of number of resultant clusters and mean square error. Keywords: mixed data clustering, growing hierarchical self-organizing map (GHSOM), modified LM algorithm, attribute-oriented induction, data mining. INTRODUCTION One of the most popular data mining approaches which are adequate for numerous applications is clustering. The major reason for its wide range of application is the capability of clustering technique to work on datasets with least or no previous knowledge. This enables clustering convenient for many real world applications. In recent times, high dimensional data has stimulated the attention of database researchers because of its significant challenges brought to the research community. In huge dimensional space, the distance between a record to its adjacent neighbor can approach its distance to the outermost record [1]. In the framework of clustering, the difficulty causes the distance among two records of the same cluster to move toward the distance among two records of various clusters. Conventional clustering approaches possibly will be unsuccessful to recognize the accurate clusters. Clustering is the unsupervised classification of patterns into groups. It is an essential data analyzing method that arranges a collection of patterns into clusters in accordance with certain similarities [2-4]. Clustering is one of the significant techniques in numerous exploratory pattern-analysis, grouping, decision-making, and machine learning application. Clustering techniques have been effectively exploited in wide ranges of areas comprising pattern recognition, biology, psychoanalysis, archaeology, geology, topography, marketing, image processing and information retrieval [5]. With the huge development in both the computer hardware and software, an enormous amount of data is produced and gathered every day. These data can be used very effectively only when the meaningful information can be extracted to find the hidden information. On the other hand, the considerable difficulty for acquiring the best information from data is owing to the limitations of the data itself [6]. These major difficulties of gathered data come from their enormous size and versatile domains. Consequently, data mining is to discover interesting patterns from huge collections of data within limited sources (i.e., computer memory and execution time) has turned out to be popular in recent years Clustering is a significant area of research for both data investigation and machine learning applications. Since new difficulties emerges continuously with the growth in different kinds of data and new techniques have to be developed to take care of large amount of data, heterogeneous in nature (numerical, symbolic, spatial, etc.). Several approaches have been developed with the purpose of arranging, summarizing or to assembling a variety of data into a set of clusters, in such a way that data belonging to an identical cluster are similar and data from other clusters are dissimilar [7, 8]. On the other hand, most of the conventional clustering approaches are developed to focus either on numeric data or on categorical data [9]. The collection of data in real world dataset would typically have both numeric and categorical attributes. It is more complicated for applying conventional clustering approaches directly into these kinds of mixed data. In practice, a common technique to cluster databases with nominal attributes (columns) is to transform them into numeric elements and exploiting a numeric clustering technique to carry out the clustering process. This is typically carried out by “exploding” the nominal element into a collection of new binary numeric