Pre-Processing Methods for Imbalanced Data Set of Wilted Tree Ahmet Murat Turk* 1 , Kemal Ozkan 2 1, Department of Computer Engineering, Anadolu University, Eskisehir, Turkiye (E-mail: ahmetmuratturk@anadolu.edu.tr) 2 Department of Computer Engineering, Eskisehir Osmangazi University, Eskisehir, Turkiye (E-mail: kozkan@ogu.edu.tr) Corresponding Author’s e-mail: ahmetmuratturk@anadolu.edu.tr ABSTRACT Machine learning algorithms builds a model based on train data which is assumed as number of instances between different classes are nearly equal. In real world problems usually data sets are unbalanced and this can cause seriously negative effect on built model. Researches on imbalance data sets focus on over-sampling minority class or under-sampling majority class and recently several methods has been purposed which modified support vector machine, rough set based minority class oriented rule learning methods, cost sensitive classifier perform good on imbalanced data set. Although these methods provides a balanced train set artificially, in some real world problems sense of error can be vital since cost of false negative error is expensive than false-positive error. For instances, during classification of satellite image for diseased tree classification, naturally most of trees in a forest is expected to be healthy. Classification algorithm is said to be effective whether critical information is not to be lost. One of the reason why tree’s become diseased in forest is inspect epidemic. Whether classification system could not detect wilted tree, it is not only cause to dry the tree but also possibility to transmission of disease will still contain by insect which can spread. Therefore main goal of this work is minimizing false negative errors. In this work, pre-processing methods for imbalance data sets which divert classification results as minimize false negative error, is discussed. Keywords: Imbalanced dataset, oversampling, undersampling 1. INTRODUCTION A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In imbalanced data sets, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria [1]. Most original classification algorithms pursue to minimize the error rate: the percentage of the incorrect prediction of class labels. They ignore the difference between types of misclassification errors. In particular, they implicitly assume that all misclassification errors cost equally. 267