International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.4, July 2020 DOI:10.5121/ijdkp.2020.10401 1 EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER DATA Parth Patel 1 , Kalpdrum Passi 1 and Chakresh Kumar Jain 2 1 Department of Mathematics and Computer Science, Laurentian University, Sudbury, Ontario, Canada 2 Department of Biotechnology, Jaypee Institute of Information Technology, Noida, India ABSTRACT Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%. KEYWORDS Microarray datasets, Feature Extraction, Feature Selection, Principal Component Analysis, Non-negative Matrix Factorization, Machine learning. 1. INTRODUCTION There has been an exponential growth in the amount and quality of biologically inspired data which are sourced from numerous experiments done across the world. If properly interpreted and analyzed, these data can be the key to solving complex problems related to healthcare. One important class of biological data used for analysis very widely is DNA microarray data, which is a commonly used technology for genome-wide expression profiling [1]. The microarray data is stored in the form of a matrix with each row representing a gene and columns representing samples, thus each element shows the expression level of a gene in a sample [2]. Gene expression is pivotal in the context of explaining most biological processes. Thus, any change within it can alter the normal working of a body in many ways and they are key to mutations [3]. Thus, studying microarray data from DNA can be a potential method for the identification of many ailments within human beings, which are otherwise hard to detect. However, due to the large size of these datasets, the complete analysis of microarray data is very complex [4]. This requires some initial pre-processing steps for reducing the dimension of the datasets without losing information.