3248 | International Journal of Current Engineering and Technology, Vol.4, No.5 (Oct 2014)
General Article
International Journal of Current Engineering and Technology
E-ISSN 2277 – 4106, P-ISSN 2347 - 5161
©2014 INPRESSCO
®
, All Rights Reserved
Available at http://inpressco.com/category/ijcet
A Comparative Study of Different Data Mining Algorithms
Shrey Bavisi
Ȧ*
, Jash Mehta
Ȧ
and Lynette Lopes
Ȧ
A
Computer Department, DJSCOE, Vile Parle (W), Mumbai – 400056, India
Accepted 02 Sept 2014, Available online 01 Oct 2014, Vol.4, No.5 (Oct 2014)
Abstract
Data Mining is used extensively in many sectors today, viz., business, health, security, informatics etc. The successful
application of data mining algorithms can be seen in marketing, retail, and other sectors of the industry. The aim of this
paper is to present the readers with the various data mining algorithms which have wide applications. This paper focuses
on four data mining algorithms K-NN, Naïve Bayes Classifier, Decision tree and C4.5. An attempt has been made to do a
comparative study on these four algorithms on the basis of theory, its advantages and disadvantages, and its
applications. After studying all these algorithms in detail, we came to a conclusion that the accuracy of these techniques
depend on various characteristics such as: type of problem, dataset and performance matrix.
Keywords: Data mining, k-NN, Naïve Bayes classifier, Decision Tree, C4.5, classification.
1. Introduction
1
Data mining is a process of exploring huge data, typically
business related data which is also called as big data. This
process is performed to find hidden patterns and
relationship present in the data. The overall objective of
the data mining process is to extract information from a
large data set and transform it into a comprehensible
structure for further use. Generally, the tasks of data
mining are two types:
1) Descriptive data mining: In descriptive data mining,
the data set is summarized in a concise manner and
presents interesting properties of the data.
2) Predictive data mining. The ultimate goal is
prediction, which is the most common application of
data mining, in this behavior of future data sets is
predicted.
The process of data mining consists of two stages:
1) Exploration Stage: This stage consists of pre-
processing data, i.e., before applying data mining
algorithms on the data, the data sets must be
assembled from all disparate sources. In other words
the data must be extracted from all sources such that
the disparity is eliminated. A common source of such
data is a data warehouse of a company, hospital, retail
chain etc. In this stage the data is cleansed and
transformed so that the noise and missing values are
dealt with.
2) Data mining Stage: This stage takes place after
performing exploration.
*Corresponding author: Shrey Bavisi
Class description: Class description provides a
concise summarization of data. This is also called as
characterization of data.
Association: Association is discovery of dependencies
or correlations in the data sets. An association rule
expressed as X=>Y means that, database tuples that
satisfy X are likely to satisfy Y
Classification: Classification analyses a set of training
data and based on the features of the training data the
classification rules are generated and models are
constructed which can be used in future for testing
data
Clustering: Clustering analysis is grouping the similar
data in the data sets. Similarity can be expressed in
terms of distance functions.
There are large numbers of data mining algorithms which
are used in the field of Engineering, Meteorology,
Informatics, Corporate Business, Sales Forecasting,
Business Forecasting Domains, Neurophysiology,
Finance, Medicine and many more. But, in this paper we
will focus mainly on commonly used mining algorithms
such as:
1) k-NN (k-Nearest Neighbours): KNN is a simple
classification and regression algorithm.
2) Naïve Bayes classifier: Naïve Bayes classifier is a
supervised learning algorithm which is used for data
classification using statistical method.
3) Decision trees: Decision trees are powerful and
popular tools for classification and prediction.
4) C4.5: C4.5 is an algorithm that was developed by
Ross Quinlan. This algorithm generates Decision trees
which can further be used for problems related to
classification.