TRANSACTIONAL PROCESSING SYSTEMS A New Data Preparation Method Based on Clustering Algorithms for Diagnosis Systems of Heart and Diabetes Diseases Nihat Yilmaz & Onur Inan & Mustafa Serter Uzer Received: 23 October 2013 /Accepted: 27 March 2014 /Published online: 16 April 2014 # Springer Science+Business Media New York 2014 Abstract The most important factors that prevent pattern recognition from functioning rapidly and effectively are the noisy and inconsistent data in databases. This article presents a new data preparation method based on clustering algorithms for diagnosis of heart and diabetes diseases. In this method, a new modified K-means Algorithm is used for clustering based data preparation system for the elimination of noisy and inconsistent data and Support Vector Machines is used for classification. This newly developed approach was tested in the diagnosis of heart diseases and diabetes, which are prev- alent within society and figure among the leading causes of death. The data sets used in the diagnosis of these diseases are the Statlog (Heart), the SPECT images and the Pima Indians Diabetes data sets obtained from the UCI database. The pro- posed system achieved 97.87 %, 98.18 %, 96.71 % classifi- cation success rates from these data sets. Classification accu- racies for these data sets were obtained through using 10-fold cross-validation method. According to the results, the pro- posed method of performance is highly successful compared to other results attained, and seems very promising for pattern recognition applications. Keywords Heart and Diabetes diseases . Support Vector Machine . Modified K-means Algorithm Introduction Pattern recognition and data mining are used in many aspects of life. These techniques are most frequently used in military, medical and industrial areas. The variety and quantity of data collected used in these applications have considerably in- creased thanks to the contribution of new measurement sys- tems. It has become nearly impossible for these data sets to be analyzed and evaluated by experts in order to obtain informa- tion that would be ultimately useful. The quality of the data is the most important aspect as it influences the quality of the results from the analysis. The data should be carefully collect- ed, integrated, characterized, and prepared for analysis. For this reason, feature selection and data reduction algorithms are being developed, which increases the performance of analysis systems or recognition systems by filtering and arraying data according to importance and by identifying unnecessary mea- surements in data sets [1, 2]. In the algorithm that we develop, k-means algorithm has been used as an instrument that provides for determining the coherent or incoherent datum inside themselves. The data reduction operation is carried out by a heuristic algorithm that runs according to incoherence information gotten from k- means algorithm. Some remarkable practices in the literature that use k-means algorithm for data reduction are like these: Patil et al. [3] have extracted the incorrectly classified datum from before original data pattern via simple k-means algo- rithm on Type-2 diabetic data, and they have classified the rest datum by the k-fold cross-validation method on C4.5 algo- rithm. Patil et al. [4], again, have divided the breast and diabetes datum into two sets correctly and incorrectly, using simple k-means algorithm and extracted the data in the set that This article is part of the Topical Collection on Transactional Processing Systems N. Yilmaz (*) : M. S. Uzer Electrical-Electronics Engineering Department, Engineering Faculty, Selcuk University, Konya, Turkey e-mail: nyilmaz@selcuk.edu.tr M. S. Uzer e-mail: msuzer@selcuk.edu.tr O. Inan Computer Engineering Department, Engineering Faculty, Selcuk University, Konya, Turkey e-mail: oinan@selcuk.edu.tr J Med Syst (2014) 38:48 DOI 10.1007/s10916-014-0048-7