International Journal of Computer Applications Technology and Research Volume 2– Issue 3, 214 - 217, 2013 www.ijcat.com 214 A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS S. Santhosh Kumar PRIST University, Thanjavur, Tamil Nadu, India E.Ramaraj School of Computing Alagappa University, Karaikudi, Tamil Nadu, India Abstract: This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability. Keywords: Classification, Clustering, C4.5, k-means Algorithm. 1. INTRODUCTION Clustering and classification are the two familiar data mining techniques used for similar and dissimilar grouping of objects respectively. Although, due to efficient use of data mining techniques, clustering and classification techniques are used as pre-process activities. The clustering technique categories the data and reduces the number of features, removes irrelevant, redundant, or noisy data, and forms the sub groups of the given data based on its relativity. The classification is used as secondary process which further divides the similar (clustered) groups in to two discrete sub groups based on the attribute value. From our research work, especially for large data bases, the prior classification is required to minimize multiple features of data so that it can be mined easily. In this paper we proposed a combined approach of classification and clustering for gene sub type prediction. 2. PRELIMINARIES 2.1 C4.5 Algorithm It is used to generate a decision tree developed by Ross Quinlan and it is an extension of ID3 algorithm. The decision trees generated by C4.5 can be used for classification, anC4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set of already classified samples. Each sample consists of a p-dimensional vector, where they represent attributes or features of the sample, as well as the class in which falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sub lists. 2.2 K-means Algorithm The k-means algorithm was developed by Mac Queen based on standard algorithm. It is one of the most widely used hard clustering techniques. This is an iterative method where the specified number of clusters should initialise earlier. One must specify the number of clusters beforehand. The algorithm can be specified as a given set of observations (x 1 , x 2 , …, x n ), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares Where μi is the mean of points in S i The algorithm works as follows: o The k (number of clusters) value number of clusters must be initialised o Randomly select k cluster centres (centroids) in the data space o Assign data points to clusters based on the shortest Euclidean distance to the cluster centers o Re-compute new cluster centers by averaging the observations assigned to a cluster o Repeat above two steps until convergence criterion is satisfied The advantage of this approach is its efficiency to handle large data sets and can work with compact clusters. The major limitation of this technique is the requirement to specify the number of clusters beforehand and its assumption that clusters are spherical. 3. RELATED STUDIES In the year 2009 CSVM [1], Clustering based classification technique is proposed by Juanying Xie, Chunxia Wang, Yan Zhang, Shuai Jiang for unlabelled data prediction. They combined different kinds of k-means algorithm with SVM classifier to achieve better results. In order to avoid the major drawback of k-means algorithm; k-value initialisation, the CSVM is proposed. In 2010 [2], Pritha Mahata proposed a new hierarchical clustering technique called ECHC (exploratory consensus of hierarchical clustering’s) which is used to sub group the various types of melanoma cancer. This work reveals that, k-means algorithm gives better results for biological subtype with proper sub tree. In 2010 [3], Taysir Hassan A. Soliman, proposed a clustering and classification as a combined approach to classify the different types of diseases based on gene selection method. The results had shown improved accuracy in data prediction. In 2011[4], Reuben Evans, Bernhard Pfahringer, Geoffrey Holmes, proposed a technique called statistical based clustering technique for large datasets. They used k-means algorithm as initial step for centroid prediction and classification as secondary step. In march 2013[5] Claudio Gentile, Fabio Vitale, Giovanni Zappella, implemented the combined technique (clustering and classification) in networks using signed graphs