Indonesian Journal of Electrical Engineering and Computer Science
Vol. 10, No. 3, June 2018, pp. 1234~1243
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v10.i3.pp1234-1243 1234
Journal homepage: http://iaescore.com/journals/index.php/ijeecs
A Survey on Cleaning Dirty Data Using Machine Learning
Paradigm for Big Data Analytics
Jesmeen M. Z. H.
1
, J. Hossen
2
, S. Sayeed
3
, C. K. Ho
4
, Tawsif K.
5
, Armanur Rahman
6
,
E. M. H. Arif
7
1,2,5,6,7
Faculty of Engineering and Technology, Multimedia University, Melaka, 75450, Malaysia
3
Faculty of Information Science & Technology, Multimedia University, Melaka, 75450, Malaysia
4
Faculty of Computing and Informatics, Multimedia University, Melaka, 75450, Malaysia
Article Info ABSTRACT
Article history:
Received Jan 15, 2018
Revised Mar 11, 2018
Accepted Mar 24, 2018
Recently Big Data has become one of the important new factors in the
business field. This needs to have strategies to manage large volumes of
structured, unstructured and semi-structured data. It’s challenging to analyze
such large scale of data to extract data meaning and handling uncertain
outcomes. Almost all big data sets are dirty, i.e. the set may contain
inaccuracies, missing data, miscoding and other issues that influence the
strength of big data analytics. One of the biggest challenges in big data
analytics is to discover and repair dirty data; failure to do this can lead to
inaccurate analytics and unpredictable conclusions. Data cleaning is an
essential part of managing and analyzing data. In this survey paper, data
quality troubles which may occur in big data processing to understand clearly
why an organization requires data cleaning are examined, followed by data
quality criteria (dimensions used to indicate data quality). Then, cleaning
tools available in market are summarized. Also challenges faced in cleaning
big data due to nature of data are discussed. Machine learning algorithms can
be used to analyze data and make predictions and finally clean data
automatically.
Keywords:
Big data
Big data analytics
Data cleaning
Dirty data
Machine learning
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Jesmeen M. Z. H. & Dr. Jakir Hossen
Faculty of Engineering and Technology,
Multimedia University,
Melaka, 75450, Malaysia.
Email: jesmeen.online@gmail.com, jakir.hossen@mmu.edu.my
1. INTRODUCTION
In 2016, IBM estimated that in last two years only, around 2.5 quintillion bytes’ data have been
produced each day, which is currently 90% of total data [1]. This big data is usually created using devices
like sensors and new technologies evolving in today’s era, even more the data evolution amount will possibly
accelerate. Whereas, Cisco forecasted by 2020, the volume of worldwide traffic will cross the Internet with
IP WAN networks may reach to 2.3ZB each year [2].
The bulky and heterogeneous nature of big data requires investigation using Big data analytics. Big
data analytics helps to discover concealed patterns, anonymous relationships, trends of current market
situation, consumer preferences and other aspects of data that can assist institutes and companies to make up-
to-date, faster and better decision for business.
By now, most well-known companies realized the demand of implementing big data analytics into
their system for better products and services. Using big data capabilities any company can improve their
products and services outcomes and grow productivity by obtaining meaningful visions to advance their work
forward. There are different tools available in market to handle the big data but these tools concernts with
few issues [3]. These tools are not usually integrated with data quality managment, therefore, in market the