Impact of Data Pruning on Machine Learning Algorithm Performance Debrup Chakraborty School of Computer Science & Statistics Trinity College Dublin Dublin, Ireland chakrabd@tcd.ie Viren Chhabria School of Computer Science & Statistics Trinity College Dublin Dublin, Ireland chhabriv@tcd.ie Aneek Barman Roy School of Computer Science & Statistics Trinity College Dublin Dublin, Ireland barmanra@tcd.ie Arun Thundyill Saseendran School of Computer Science & Statistics Trinity College Dublin Dublin, Ireland thundyia@tcd.ie Lovish Setia School of Computer Science & Statistics Trinity College Dublin Dublin, Ireland setial@tcd.ie Abstract: Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on an iteratively pruned dataset. The goal was to understand whether an algorithm (say A) on an unpruned dataset performs better than another algorithm (say B), will algorithm B perform better on the pruned data or vice-versa. The dataset chosen for our analysis is a subset of the largest movie ratings database publicly available on the internet, IMDb [1]. The learning objective of the model was to predict the categorical rating of a movie among 5 bins: poor, average, good, very good, excellent. The results indicated that an algorithm that performed better on an unpruned dataset also performed better on a pruned dataset. Keywords: movie rating, IMDb, data pruning 1 INTRODUCTION A fine line separates cleaning and pruning of a dataset. Cleaning mostly is a preprocessing step that involves removing unrequired data, data imputation, standardizing or normalizing the feature ranges and converting categorical values to numbers [2] [3]. In comparison pruning takes place after preprocessing, where certain data is strategically removed to improve the machine learning model. In this paper we try to bring forth the effect of dataset pruning on the performance of different machine learning algorithms, i.e. If an algorithm (say A) on an unpruned dataset performs better than another algorithm (say B), will algorithm B perform better on the pruned data or vice-versa. 2 RELATED WORK Data pruning had been defined in 2005 as an automated process of noise cleaning and the performance of this mechanism was measured using SVC and AdaBoost algorithms [4]. Removal of certain portions of the dataset is determined to be worthwhile and said to affect the performance of machine learning algorithms [4]. A mathematical model was proposed to predict the success of upcoming movies based on correlation of factors affecting the success of a movie [5]. Automatic rating prediction was proposed in 2011 using the IMDb dataset, however the results were inferior to baseline which was attributed to the dataset lacking diversity in terms of user rating [6]. 3 METHODOLOGY 3.1 Dataset The dataset chosen is from the largest publicly available movie rating database, IMDb [1]. It contains 5,043 movies with 28 attributes, with IMDb score indicating the movie ratings on a scale of 1-10. The histogram in Figure 1 shows the frequency distribution of the IMDb score indicating the rating between 6 and 7 to be the highest. Figure 1: Frequency of IMDb Score of raw dataset 3.2 Pre-processing IMDb ratings have continuous values in the range 1-10. The ratings were categorized into 5 classes: poor, average, good, very good, excellent based on the bins [0, 7, 8, 8.5, 9, 10]. Missing numeric data was imputed with the mean of the available values, while the missing categorical data was imputed as a “Missing” category altogether. Duplicate tuples were removed. Categorical data was transformed to numbers using LabelEncoder and OneHotEncoder. The feature data was standardized using StandardScaler. The