Dhinaharan Nagamalai et al. (Eds) : SAI, NCO, SOFT, ICAITA, CDKP, CMC, SIGNAL - 2019
pp. 83-99, 2019. © CS & IT-CSCP 2019 DOI: 10.5121/csit.2019.90707
FACTORS AFFECTING CLASSIFICATION
ALGORITHMS RECOMMENDATION: A SURVEY
Mariam Moustafa Reda
1
, Dr Mohammad Nassef
2
and Dr Akram Salah
3
1,2,3
Computer Science Department, Faculty of Computers and Information,
Cairo University, Giza, Egypt
ABSTRACT
A lot of classification algorithms are available in the area of data mining for solving the same
kind of problem with a little guidance for recommending the most appropriate algorithm to use
which gives best results for the dataset at hand. As a way of optimizing the chances of
recommending the most appropriate classification algorithm for a dataset, this paper focuses on
the different factors considered by data miners and researchers in different studies when
selecting the classification algorithms that will yield desired knowledge for the dataset at hand.
The paper divided the factors affecting classification algorithms recommendation into business
and technical factors. The technical factors proposed are measurable and can be exploited by
recommendation software tools.
KEYWORDS
Classification, Algorithm selection, Factors, Meta-learning, Landmarking
1. INTRODUCTION
There is a lot of raw data stored in business organizations databases, and with the progressively
competitive markets and computers capabilities, businesses find themselves faced with the
massive amount of data stored and the need to identify patterns, correlations, and predictive
information that business experts may miss. Data mining is the field that helps business experts
make better decisions based on the discovered patterns and relationships in the data available.
One key data mining task is classification, where it addresses the problem of assigning the unit of
analysis of a dataset to target classes to help in more accurate predictions. There are different
categories of classification algorithms. But, any classification algorithm needs one or more fields
to be used as predictors, and a target field to predict.
To stay on track in a data mining project, a standard methodology or a list of best practices has to
be followed. Efforts were made to use a standard data mining methodology that will guide the
implementation of different data mining tasks, [1]. The most popular methodologies followed by
researchers are CRISP-DM: Cross-industry standard process for data mining and SEMMA:
Sample, Explore, Modify, Model, and Assess. CRISP-DM was founded by the European
Strategic Program on Research in Information Technology, while SEMMA was developed by
SAS Institute. Both of these methodologies have well-defined phases for modelling the data by an
algorithm and evaluating the model after being created. Also, the first methodology; KDD:
Knowledge Discovery in Database was adopted for years by data scientists. During modelling,
there are several algorithms that could be used to perform the same data mining task and still
produce different results. For example, to address a classification problem, one may choose from
many algorithms, neural nets, where it has a lot of variants and considered as a black box model,
another option is C5.0 and CHAID, which are considered as decision tree algorithms, last but not