A Cluster Based Analysis for Imbalanced Data using SMOTE and Cluster-Based Classification Arbind Kumar Chaurasia 1 , Sohit Agarwal 2 , 1 Research Scholar(M.Tech-CSE), 2 Assistant Professor 1,2 Department of Computer Science and Engineering Suresh Gyan Vihar University, Jaipur-302017 Abstract—There is tremendous upturn in data repositories because of data generation by various organizations like government, cooperates, health caring in large amounts. Large amount of data is being produced, processed, collected, and analysed online. So there comes a requirement to transform this data into valuable information. This process of extracting the knowledge from large amount of data is referred as data mining. The proposed hybrid approach can be checked on different classifiers like Naïve Bayes, Random forest classifier etc. In proposed methodology we find that SMOTE algorithm which used K-nearest neighbour algorithm is limited to some minority class instances for creating synthetic samples, which sometimes leads to over fitting, so an effective oversampling approach can be developed. Keywords- Cluster, Classification, Imbalance data, Analysis, Prediction I. INTRODUCTION The majority of data in the original word are balanced. This happens if the distribution of the target class among different class levels is not equivalent. This classification of data is one of the toughest problems in machine learning and has become quite important recently. This has contributed to the development of most popular machine learning algorithms to maximize total accuracy, which is the percentage of precise predictions of any classifier. This results in a very low sensitivity and high accuracy to the positive class. The best approach is therefore not to concentrate on total precision but to optimize the sensitivities of the positive and negative groups separately. To overcome this problem, several methods have been developed: Samples conform to the previous distribution of the minority and the majority the distribution of balanced classes in the training results. The techniques of sampling can be classified according to basic sampling and advanced methods. Primary sampling techniques include random minority class sampling (RSS), random minority class sampling (ROS) and the composite sampling of both. But with random over-samples of minority data, it is possible that certain minority groups are somewhat enhanced, so that the model is trained in this case leads highly to overfitting. In Mukt Shabd Journal Volume IX, Issue VI, JUNE/2020 ISSN NO : 2347-3150 Page No : 3375