VOL. 7, NO. 9, SEPTEMBER 2012 ISSN 1819-6608
ARPN Journal of Engineering and Applied Sciences
© 2006-2012 Asian Research Publishing Network (ARPN). All rights reserved.
www.arpnjournals.com
1162
AN INTEGRATED GHSOM-MLP WITH MODIFIED LM ALGORITHM
FOR MIXED DATA CLUSTERING
D. Hari Prasad
1
and M. Punithavalli
2
1
Department of Computer Applications, Sri Ramakrishna Institute of Technology, Coimbatore, India
2
Department of Computer Applications, Sri Ramakrishna Engineering College, Coimbatore, India
E-Mail: hari.research@yahoo.com
ABSTRACT
Data clustering is one of the common approaches used to carry out statistical data analysis, which is used in
several fields, together with machine learning, data mining, customer requirement, trend investigation, pattern
identification and image analysis. Even though many clustering approaches have been available, but most of them manage
only the clustering of numerical data. On the other hand, the problem of clustering mixed data is more complicated and
difficult as mixed data have nominal attributes. A large number of these algorithms deals with only on numeric data, a
small number of algorithms take care of nominal data, and only a very least amount of algorithms can handle both numeric
and nominal values. In order to provide an efficient mixed data clustering, there is a significant need for several approaches
to handle mixed data clustering. The existing mixed data clustering techniques takes more time for clustering the usage of
SOM has the inability to capture the inherent hierarchical structure of data. To overcome this, an integrated GHSOM-MLP
with Modified LM Algorithm is proposed in this paper. The experimentation for the proposed technique is carried with the
help of UCI Adult Data Set to compare the proposed approach with GHSOM, in terms of number of resultant clusters and
mean square error.
Keywords: mixed data clustering, growing hierarchical self-organizing map (GHSOM), modified LM algorithm, attribute-oriented
induction, data mining.
INTRODUCTION
One of the most popular data mining approaches
which are adequate for numerous applications is
clustering. The major reason for its wide range of
application is the capability of clustering technique to
work on datasets with least or no previous knowledge.
This enables clustering convenient for many real world
applications. In recent times, high dimensional data has
stimulated the attention of database researchers because of
its significant challenges brought to the research
community. In huge dimensional space, the distance
between a record to its adjacent neighbor can approach its
distance to the outermost record [1]. In the framework of
clustering, the difficulty causes the distance among two
records of the same cluster to move toward the distance
among two records of various clusters. Conventional
clustering approaches possibly will be unsuccessful to
recognize the accurate clusters.
Clustering is the unsupervised classification of
patterns into groups. It is an essential data analyzing
method that arranges a collection of patterns into clusters
in accordance with certain similarities [2-4]. Clustering is
one of the significant techniques in numerous exploratory
pattern-analysis, grouping, decision-making, and machine
learning application. Clustering techniques have been
effectively exploited in wide ranges of areas comprising
pattern recognition, biology, psychoanalysis, archaeology,
geology, topography, marketing, image processing and
information retrieval [5].
With the huge development in both the computer
hardware and software, an enormous amount of data is
produced and gathered every day. These data can be used
very effectively only when the meaningful information can
be extracted to find the hidden information. On the other
hand, the considerable difficulty for acquiring the best
information from data is owing to the limitations of the
data itself [6]. These major difficulties of gathered data
come from their enormous size and versatile domains.
Consequently, data mining is to discover interesting
patterns from huge collections of data within limited
sources (i.e., computer memory and execution time) has
turned out to be popular in recent years
Clustering is a significant area of research for
both data investigation and machine learning applications.
Since new difficulties emerges continuously with the
growth in different kinds of data and new techniques have
to be developed to take care of large amount of data,
heterogeneous in nature (numerical, symbolic, spatial,
etc.). Several approaches have been developed with the
purpose of arranging, summarizing or to assembling a
variety of data into a set of clusters, in such a way that
data belonging to an identical cluster are similar and data
from other clusters are dissimilar [7, 8].
On the other hand, most of the conventional
clustering approaches are developed to focus either on
numeric data or on categorical data [9]. The collection of
data in real world dataset would typically have both
numeric and categorical attributes. It is more complicated
for applying conventional clustering approaches directly
into these kinds of mixed data.
In practice, a common technique to cluster
databases with nominal attributes (columns) is to
transform them into numeric elements and exploiting a
numeric clustering technique to carry out the clustering
process. This is typically carried out by “exploding” the
nominal element into a collection of new binary numeric