IEEE TRANSACTIONS ON CYBERNETICS 1 Weighted Data Gravitation Classification for Standard and Imbalanced Data Alberto Cano, Member, IEEE, Amelia Zafra, Member, IEEE, and Sebasti´ an Ventura, Senior Member, IEEE Abstract—Gravitation is a fundamental interaction whose con- cept and effects applied to data classification become a novel data classification technique. The simple principle of data gravitation classification (DGC) is to classify data samples by comparing the gravitation between different classes. However, the calculation of gravitation is not a trivial problem due to the different relevance of data attributes for distance computation, the presence of noisy or irrelevant attributes, and the class imbalance problem. This paper presents a gravitation-based classification algorithm which improves previous gravitation models and overcomes some of their issues. The proposed algorithm, called DGC+, employs a matrix of weights to describe the importance of each attribute in the classification of each class, which is used to weight the distance between data samples. It improves the classification performance by considering both global and local data information, especially in decision boundaries. The proposal is evaluated and compared to other well-known instance-based classification techniques, on 35 standard and 44 imbalanced data sets. The results obtained from these experiments show the great performance of the proposed gravitation model, and they are validated using several nonparametric statistical tests. Index Terms—Classification, covariance matrix adaptation evo- lution strategy (CMA-ES), data gravitation, evolutionary strate- gies, imbalanced data. I. I NTRODUCTION S UPERVISED learning is one of the most fundamental tasks in machine learning. A supervised learning algo- rithm analyzes a set of training examples and produces an inferred function to predict the correct output for any other examples. Classification is a common task in supervised machine learning which aims at predicting the correct class for a given example. Classification has been successfully implemented using many different paradigms and techniques, such as artificial neural networks [1], support vector machines (SVMs) [2], instance-based learning methods [3], or nature- inspired techniques such as genetic programming [4]. The nearest neighbor (NN) algorithm [5] is an instance- based method which might be the simplest classification algo- rithm. Its classification principle is to classify a new sample Manuscript received May 24, 2012; revised September 5, 2012; accepted October 25, 2012. This work was supported by the Regional Government of Andalusia and the Ministry of Science and Technology, projects P08-TIC- 3720 and TIN-2011-22408, FEDER funds, and Ministry of Education FPU grant AP2010-0042. This paper was recommended by Editor J. Basak. The authors are with the Department of Computer Science and Nu- merical Analysis, University of Cordoba, 14071 Cordoba, Spain (e-mail: acano@uco.es; azafra@uco.es; sventura@uco.es). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2012.2227470 with the class of the closest training sample. The extended version of NN to k neighbors (KNN) and their derivatives are indeed one of the most influential data mining techniques, and they have been shown to perform well in many domains [6]. However, the main problem with these methods is that they severely deteriorate with noisy data or high dimensionality: their performance becomes very slow, and their accuracy tends to deteriorate as the dimensionality increases, especially when classes are nonseparable or they overlap [7]. In recent years, new instance-based methods based on data gravitation classification (DGC) have been proposed to solve the aforementioned problems of the NN classifiers [8]–[10]. DGC models are inspired by Newton’s law of universal grav- itation and simulate the accumulative attractive force between data samples to perform the classification. These gravitation- based classification methods extend the NN concept to the law of gravitation among the objects in the physical world. The basic principle of DGC is to classify data samples by comparing the data gravitation among the training samples for the different data classes, whereas KNNs vote for the k training samples that are the closest in the feature space. This paper presents a DGC algorithm (DGC+) that com- pares the gravitational field for the different data classes to predict the class with the highest magnitude. The proposal improves previous data gravitation algorithms by learning the optimal weights of the attributes for each class and solves some of their issues such as nominal attributes handling, imbalanced data performance, and noisy data filtering. The weights of the attributes in the classification of each class are learned by means of the covariance matrix adaptation evolution strategy (CMA-ES) [11] algorithm, which is a well- known, robust, and scalable global stochastic optimizer for difficult nonlinear and nonconvex continuous domain objective functions [12]. The proposal improves accuracy results by considering both global and local data information, especially in decision boundaries. The experiments have been carried out on 35 standard and 44 imbalanced data sets collected from the KEEL [13] and UCI [14] repositories. The algorithms compared for both standard and imbalanced classification have been selected from the KEEL [15] and WEKA [16] software tools. The experiments consider different problem domains, number of instances, attributes, and classes. The algorithms from the experiments include some of the most relevant instance-based and imbalanced classification techniques presented up to now. The results reported show the competitive performance of the proposal, obtaining significantly better results in terms of predictive accuracy, Cohen’s kappa rate [17], [18], and area under the curve (AUC) [19], [20]. The experimental