This is the author's version of an article that has been published in this conference. Changes were made to this version by the publisher prior to publication. The final version of record is available at 10.1109/HPCC/SmartCity/DSS.2019.00397 Please cite as: Susan, Seba, and Amitesh Kumar. "Learning Data Space Transformation Matrix from Pruned Imbalanced Datasets for Nearest Neighbor Classification." In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 2831-2838. IEEE, 2019. Learning Data Space Transformation Matrix from Pruned Imbalanced Datasets for Nearest Neighbor Classification Seba Susan, Amitesh Kumar Department of Information Technology, Delhi Technological University, Delhi, India-110042, Email: seba_406@yahoo.in Abstract— The nearest neighbor classifier is deemed to be the litmus test for the worst-case scenario and has been often been relied on by data mining researchers to test the robustness of their algorithms. However not all datasets are suited for the distance-based classification, and a prior transformation of the data space generally helps. This paper proposes a novel hybrid sampling with data space transformation that boosts the performance of the nearest neighbor classifier for imbalanced datasets. The SSOMaj- SMOTE-SSOMin three-step pruning and resampling technique, introduced in a recent work by the authors, is used in the first stage of our experiments for achieving a balance between the under-represented minority and over- represented majority class. In the second stage, the transformation matrix learnt from the pruned dataset is used to transform the pruned training distribution and the test sample space. The proposed method thus enforces the spatial arrangement of the sampled training dataset into the test sample space. A variety of popular data space transformation techniques are investigated for the application. The consequences of transforming the original training space, based on the transformation learnt from the pruned training set are also investigated. Experiments on benchmark datasets with comparison to the state-of-the-art prove the supremacy of our learning approach for imbalanced datasets. Keywords—Imbalanced datasets; Nearest Neighbor classifier; Data Space Transformation; SSOMaj-SMOTE- SSOMin I. INTRODUCTION The k-Nearest Neighbor (kNN) classifier is one of the primitive classification methods for classifying patterns based on their distances from labeled pre-existing patterns [3]. The advantage of the kNN classifier is its non-linear decision boundary that makes no presumptions about class distributions and a single parameter that is tuned during cross-validation [6]. The Euclidean distance is usually computed. A majority voting of class labels among the k nearest neighbors (labeled examples) reveals the class of the test sample. The value of k for a particular application is usually empirically determined to suit the application. A value of k=1 indicates the nearest neighbor classifier that is a popular test for worst-case classification, as a test of robustness of the training samples. Distance-based classification is still a reliable option for computer vision experiments [9, 10] and information retrieval (IR) [11, 12], wherein distance-based similarity measures are computed for retrieving the top-K matches to the training template. Fuzzy classification techniques too often translate distance similarity values to fuzzy memberships and fuzzy-based decision-making [13, 14]. Like all other classifiers, kNN is also affected by the issue of imbalance in class representation prevalent in real- world datasets today, and some variants of kNN that can handle imbalance between classes have been proposed [15]. Due to insufficient number of samples from the minority category, the k-nearest neighbors of a minority test instance may include several unwanted members of the other group that may affect the decision on the class label of the test instance. To overcome this problem and to increase the efficiency of the nearest neighbor classifier for imbalanced datasets, a new scheme of data space transformation of sampled and pruned datasets is proposed in this paper. The organization of this paper is as follows. The data space transformation techniques are reviewed in Section II, the methodology of hybrid sampling with data space transformation is described in Section III, the results are analyzed in Section IV and the conclusions are drawn in Section V. II. DATA SPACE TRANSFORMATION - A REVIEW In order to explain the various data space transformation techniques, we start with a set of labeled training data , x x d i i having corresponding class labels , 1,.., = i y i n . The task is to learn a transformation matrix from the labeled instances that would improve kNN classification. Here, we review several techniques that are a part of our experimentation. A. Local Fisher Discriminant Analysis The Local Fisher Discriminant Analysis (LFDA) proposed by Sugiyama in [4], defines a transformation matrix that brings closer nearby data pairs within the same class, and pulls farther apart data pairs from different classes even if their values were actually close. This is interpreted technically as reducing the within-class scatter ( ) w S and increasing the between-class scatter () b S . The LFDA transformation matrix is therefore defined as ( ) ( ) ( ) 1 ( ) () arg max − = T w T b LFDA T T tr TS T TS T (1)