© 2018 JETIR August 2018, Volume 5, Issue 8 www.jetir.org (ISSN-2349-5162) JETIRC006413 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 553 DATA WRANGLING AND DATA LEAKAGE IN MACHINE LEARNING FOR HEALTHCARE 1 Saravanan N, 2 Sathish G, 3 Balajee J M 1 2 Assistant Professor, 3 Research Scholar 1 2 Department of Computer Application, 1 2 Priyadarshini Engineering College, Vaniyambadi, Vellore, Tamilnadu, India 3 School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamilnadu, India Abstract: Nowadays, healthcare and life sciences overall have produced massive amounts of real-time data by enterprise resource planning (ERP). These large of amount data is a difficult task to handle, and intimidation of data leakage by inside worker rises, the firms are smearing way out for security such as Data Loss Prevention (DLP) and Digital Rights Management (DRM) to prevent data leakage. On the other hand, data leakage system also turns into varied and challenging to avert data leakage. Machine learning techniques are used for the handling of significant data by evolving algorithms and set of rules to provide the prerequisite results to the workers. Deep learning has automatic feature extraction that grasps the essential features necessary for the solution of the problem. It reduces the issue of the workers to select elements explicitly to solve the problems for supervised, unsupervised and semi-supervised for healthcare data’s. IndexTerms: Data leakage, Machine Learning, Deep Learning, Healthcare, Enterprise resource planning Introduction Deep learning and Machine learning plays a vital role in today’s ERP (Enterprise Resource Planning). In the practice of build ing the analytical model with Deep Learning or Machine Learning the data set is collected as of various sources such as a file, database, sensors and much more [1]. The received data cannot be used openly for carrying out analysis process. To solve this problem Data Preparation is done by using two techniques are data preprocessing and data wrangling [2]. Data Preparation is an essential part of Data Science. It consists of two notions such as Data Cleaning and Feature Engineering. These two are unavoidable for achieving better accuracy and performance in the Machine Learning and Deep Learning tasks [3]. Data Preprocessing is a procedure that is used to transform the raw data into a clean data set. Also, every time the data collected from different sources in raw format which is not viable for the analysis [4]. Therefore, specific phases are executed to convert the data into a small clean dataset. This technique is implemented earlier the execution of Iterative Analysis. The set of steps is well- known as Data Preprocessing. It includes Data Cleaning, Data Integration, Data Transformation and Data Reduction. Data Wrangling is a technique executed at the time of making an interactive model. In other words, it is used to convert the raw data into the format that is convenient for the consumption of data. This technique is also known as Data Munging. This method also follows specific steps such as after extracting the data from different data sources, sorting of data using particular algorithm is performed, decompose the data into a separate structured format and finally store the data into another database [5]. In order to achieve better results from the applied model in Machine Learning and Deep Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning and Deep Learning model need data in a specified format, for example, Random Forest algorithm does not support null values, and therefore to execute random forest algorithm null values has to be managed from the original raw data set [6]. Another aspect is that dataset should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one dataset and best out of them is chosen. Data wrangling is an important aspect for implementing the model. Therefore, data is converted to the proper feasible format before applying any model intro it [7]. By performing filtering, grouping and selecting appropriate data accuracy and performance of the model could be increased. Another concept is that when time series data has to be handled every algorithm is executed with different aspects. Therefore Data Wrangling is used to convert the time series data into the required format of the applied model [8]. In simple words, the complex data is converted into a usable format for performing analysis into it. Need of data preprocessing and data wrangling In order to achieve better results from the applied model in Machine Learning and Deep Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning and Deep Learning model need data in a specified format, for example, Random Forest algorithm does not support null values, and therefore to execute random forest algorithm null values has to be managed from the original raw data set [9]. An additional phase is that dataset should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one dataset and best out of them is chosen.