3248 | International Journal of Current Engineering and Technology, Vol.4, No.5 (Oct 2014) General Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 - 5161 ©2014 INPRESSCO ® , All Rights Reserved Available at http://inpressco.com/category/ijcet A Comparative Study of Different Data Mining Algorithms Shrey Bavisi Ȧ* , Jash Mehta Ȧ and Lynette Lopes Ȧ A Computer Department, DJSCOE, Vile Parle (W), Mumbai 400056, India Accepted 02 Sept 2014, Available online 01 Oct 2014, Vol.4, No.5 (Oct 2014) Abstract Data Mining is used extensively in many sectors today, viz., business, health, security, informatics etc. The successful application of data mining algorithms can be seen in marketing, retail, and other sectors of the industry. The aim of this paper is to present the readers with the various data mining algorithms which have wide applications. This paper focuses on four data mining algorithms K-NN, Naïve Bayes Classifier, Decision tree and C4.5. An attempt has been made to do a comparative study on these four algorithms on the basis of theory, its advantages and disadvantages, and its applications. After studying all these algorithms in detail, we came to a conclusion that the accuracy of these techniques depend on various characteristics such as: type of problem, dataset and performance matrix. Keywords: Data mining, k-NN, Naïve Bayes classifier, Decision Tree, C4.5, classification. 1. Introduction 1 Data mining is a process of exploring huge data, typically business related data which is also called as big data. This process is performed to find hidden patterns and relationship present in the data. The overall objective of the data mining process is to extract information from a large data set and transform it into a comprehensible structure for further use. Generally, the tasks of data mining are two types: 1) Descriptive data mining: In descriptive data mining, the data set is summarized in a concise manner and presents interesting properties of the data. 2) Predictive data mining. The ultimate goal is prediction, which is the most common application of data mining, in this behavior of future data sets is predicted. The process of data mining consists of two stages: 1) Exploration Stage: This stage consists of pre- processing data, i.e., before applying data mining algorithms on the data, the data sets must be assembled from all disparate sources. In other words the data must be extracted from all sources such that the disparity is eliminated. A common source of such data is a data warehouse of a company, hospital, retail chain etc. In this stage the data is cleansed and transformed so that the noise and missing values are dealt with. 2) Data mining Stage: This stage takes place after performing exploration. *Corresponding author: Shrey Bavisi Class description: Class description provides a concise summarization of data. This is also called as characterization of data. Association: Association is discovery of dependencies or correlations in the data sets. An association rule expressed as X=>Y means that, database tuples that satisfy X are likely to satisfy Y Classification: Classification analyses a set of training data and based on the features of the training data the classification rules are generated and models are constructed which can be used in future for testing data Clustering: Clustering analysis is grouping the similar data in the data sets. Similarity can be expressed in terms of distance functions. There are large numbers of data mining algorithms which are used in the field of Engineering, Meteorology, Informatics, Corporate Business, Sales Forecasting, Business Forecasting Domains, Neurophysiology, Finance, Medicine and many more. But, in this paper we will focus mainly on commonly used mining algorithms such as: 1) k-NN (k-Nearest Neighbours): KNN is a simple classification and regression algorithm. 2) Naïve Bayes classifier: Naïve Bayes classifier is a supervised learning algorithm which is used for data classification using statistical method. 3) Decision trees: Decision trees are powerful and popular tools for classification and prediction. 4) C4.5: C4.5 is an algorithm that was developed by Ross Quinlan. This algorithm generates Decision trees which can further be used for problems related to classification.