Indonesian Journal of Electrical Engineering and Computer Science Vol. 10, No. 3, June 2018, pp. 1234~1243 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v10.i3.pp1234-1243 1234 Journal homepage: http://iaescore.com/journals/index.php/ijeecs A Survey on Cleaning Dirty Data Using Machine Learning Paradigm for Big Data Analytics Jesmeen M. Z. H. 1 , J. Hossen 2 , S. Sayeed 3 , C. K. Ho 4 , Tawsif K. 5 , Armanur Rahman 6 , E. M. H. Arif 7 1,2,5,6,7 Faculty of Engineering and Technology, Multimedia University, Melaka, 75450, Malaysia 3 Faculty of Information Science & Technology, Multimedia University, Melaka, 75450, Malaysia 4 Faculty of Computing and Informatics, Multimedia University, Melaka, 75450, Malaysia Article Info ABSTRACT Article history: Received Jan 15, 2018 Revised Mar 11, 2018 Accepted Mar 24, 2018 Recently Big Data has become one of the important new factors in the business field. This needs to have strategies to manage large volumes of structured, unstructured and semi-structured data. It’s challenging to analyze such large scale of data to extract data meaning and handling uncertain outcomes. Almost all big data sets are dirty, i.e. the set may contain inaccuracies, missing data, miscoding and other issues that influence the strength of big data analytics. One of the biggest challenges in big data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics and unpredictable conclusions. Data cleaning is an essential part of managing and analyzing data. In this survey paper, data quality troubles which may occur in big data processing to understand clearly why an organization requires data cleaning are examined, followed by data quality criteria (dimensions used to indicate data quality). Then, cleaning tools available in market are summarized. Also challenges faced in cleaning big data due to nature of data are discussed. Machine learning algorithms can be used to analyze data and make predictions and finally clean data automatically. Keywords: Big data Big data analytics Data cleaning Dirty data Machine learning Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Jesmeen M. Z. H. & Dr. Jakir Hossen Faculty of Engineering and Technology, Multimedia University, Melaka, 75450, Malaysia. Email: jesmeen.online@gmail.com, jakir.hossen@mmu.edu.my 1. INTRODUCTION In 2016, IBM estimated that in last two years only, around 2.5 quintillion bytes’ data have been produced each day, which is currently 90% of total data [1]. This big data is usually created using devices like sensors and new technologies evolving in today’s era, even more the data evolution amount will possibly accelerate. Whereas, Cisco forecasted by 2020, the volume of worldwide traffic will cross the Internet with IP WAN networks may reach to 2.3ZB each year [2]. The bulky and heterogeneous nature of big data requires investigation using Big data analytics. Big data analytics helps to discover concealed patterns, anonymous relationships, trends of current market situation, consumer preferences and other aspects of data that can assist institutes and companies to make up- to-date, faster and better decision for business. By now, most well-known companies realized the demand of implementing big data analytics into their system for better products and services. Using big data capabilities any company can improve their products and services outcomes and grow productivity by obtaining meaningful visions to advance their work forward. There are different tools available in market to handle the big data but these tools concernts with few issues [3]. These tools are not usually integrated with data quality managment, therefore, in market the