Data Preparation Zahraa S. Abdallah, Monash University Lan Du, Monash University Geoffrey I. Webb, Monash University Synonyms Data preprocessing. Data Wrangling 1. Summary Before data can be analyzed they must be organized into an appropriate form. Data preparation is the process of manipulating and organizing data prior to analysis. Data preparation is typically an iterative process of manipulating raw data, which is often unstructured and messy, into a more structured and useful form that is ready for further analysis. The whole preparation process consists of a series of major activities (or tasks) including data profiling, cleansing, integration and transformation. 2. Motivation and Background Data are collected for many purposes, not necessarily with machine learning or data mining in mind. Consequently, there is often a need to identify and extract relevant data for the given analytic purpose. Every learning system has specific requirements about how data must be presented for analysis and hence data must be transformed to fulfill those requirements. Further, the selection of the specific data to be analyzed can greatly affect the models that are learned. For these reasons, data preparation is a critical part of any machine learning exercise, and is often the most time-consuming part of any non- trivial machine learning or data mining project. In most cases, the preparation process consists of dozens of transformations and needs to be repeated several times. Despite advances in technologies for working with data, each of those transformations may involve much-handcrafted work and can consume a significant amount of time and effort. Thus, working with huge and diverse data remains a challenge. It is often agreed that data wrangling/preparation is the most tedious and time-consuming aspect of data analysis. It has become a big bottleneck or "iceberg" for performing advanced data analysis, particularly on big data. A recent article in the New York Times [3] reported that the whole process of data wrangling could account up to 80% of the time in the analysis cycle. In other words, there is only a small fraction of time for data analysts and scientists to do analysis work. According to the data science report [8], published by Crown in 2015, messy and disorganized data are the number one obstacle holding data scientists back. The same study reports that 70% of a data scientist’s time is spent in cleaning data. To be published in C Sammut and G I Webb (Eds) Encyclopedia of Machine Learning and Data Mining, Springer, 2017.