Council for Innovative Research International Journal of Computers & Technology www.cirworld.com Volume 4 No. 1, Jan-Feb, 2013 ISSN 2277-3061 114 | Page www.ijc tonline.com A CLUSTER ANALYSIS AND DECISION TREE HYBRID APPROACH IN DATA MINING TO DESCRIBING TAX AUDIT Richa Dhiman 1 Department of computer science and engineering, Lovely Professional University, Phagwara Richadhiman58@gmail.com Sheveta Vashisht 2 Department of computer science and engineering, Lovely Professional University, Phagwara sheveta.16856@lpu.co.in Kapil Sharma 3 Department of computer science and engineering, Lovely Professional University, Phagwara kapilsharma701@gmail.com ABSTRACT In this research, we use clustering and classification methods to mine the data of tax and extract the information about tax audit by using hybrid algorithms K-MEANS, SOM and HAC algorithms from clustering and CHAID and C4.5 algorithms from decision tree and it produce the better results than the traditional algorithms and compare it by applying on tax dataset. Clustering method will use for make the clusters of similar groups to extract the easily features or properties and decision tree method will use for choose to decide the optimal decision to extract the valuable information from samples of tax datasets? This comparison is able to find clusters in large high dimensional spaces efficiently. It is suitable for clustering in the full dimensional space as well as in subspaces. Experiments on both synthetic data and real-life data show that the technique is effective and also scales well for large high dimensional datasets. Keywords- Clustering, Decision tree, HAC, SOM, C4.5. I. INTRODUCTION Data mining is the important step for discover the knowledge in knowledge discovery process in data set. Data mining provide us useful pattern or model to discovering important and useful data from whole database. We used different algorithms to extract the valuable data. To mine the data we use these [1] important steps or tasks: Classification use to classify the data items into the predefined classes and find the model to analysis. Regression identifies real valued variables. Clustering use to describe the data and categories into similar objects in groups. Find the dependencies between variables. Mine the data using tools. Clustering and decision tree are two of the mostly used methods of data mining which provide us much more convenience in researching information data. Cluster analysis groups objects based on the information found in the data describing the objects or their relationships. The goal is that the objects in a group will be similar to one other and different from the objects in other groups. The greater the similarity or homogeneity within a group and the greater the difference between groups, the “better” or more distinct the clustering. Clustering is a tool for data analysis, which solves classification problems. Its object is to distribute cases into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters. This way each cluster describes, in terms of data collected, the class to which its members belong. Classification is an important task in data mining. It belongs to directed learning and the main methods include decision tree, neural network and genetic algorithm. Decision tree build its optimal tree model by selecting important association features. While selection of test attribute and partition of sample sets are two parts in building trees. Different decision tree methods will adopt different technologies to settle these problems. Algorithms include ID3, C4.5, CART and SPRINT etc. II. BACKGROUND Ji Dan et al (2010) they presented a new synthesized data mining algorithm[3] named CA which improves the original methods of CURE and C4.5. CA introduces principle component analysis (PCA), [2] grid partition and parallel processing which can achieve feature Reduction and scale reduction for large-scale datasets. This paper applies CA algorithm to maize seed breeding and the results of experiments show that our approach is better than Original methods. They introduces feature reduction, scale reduction and classification analysis to handle large and high dimensional dataset By applying CA algorithm in maize seed breeding and find out the important features which will influence breeding tremendously and obtain the classification model of whole maize samples. They conclude that efficiency of CA is higher not only in clustering but also in decision tree. CA is sensitive to some parameters like the clustering number, shrink factors and the threshold etc. C4.5 only can deal with the dataset which has the classification feature. The dataset we treated is a little small which will impact the final output of algorithms. Guojun Mao et al (2011) micro-cluster based[3] classification problems in distributed data streams, and proposes an approach for mining data streams in the distributed environments with both labelled and unlabeled data. For each local site, a local micro-cluster based ensemble is used and its updating algorithms are designed. Making use of the time-based sliding window techniques, the local models in a fixed time-span are transferred to a central site after being generated in all