(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 146 | Page www.ijacsa.thesai.org Applications of Clustering Techniques in Data Mining: A Comparative Study Muhammad Faizan 1 , Megat F. Zuhairi 2* , Shahrinaz Ismail 3 , Sara Sultan 4 Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur, Malaysia 1, 2, 3 College of Computing and Information Sciences, Karachi Institute of Economics and Technology, Karachi, Pakistan 4 Abstract—In modern scientific research, data analyses are often used as a popular tool across computer science, communication science, and biological science. Clustering plays a significant role in the reference composition of data analysis. Clustering, recognized as an essential issue of unsupervised learning, deals with the segmentation of the data structure in an unknown region and is the basis for further understanding. Among many clustering algorithms, “more than 100 clustering algorithms known” because of its simplicity and rapid convergence, the K-means clustering algorithm is commonly used. This paper explains the different applications, literature, challenges, methodologies, considerations of clustering methods, and related key objectives to implement clustering with big data. Also, presents one of the most common clustering technique for identification of data patterns by performing an analysis of sample data. Keywords—Clustering; data analysis; data mining; unsupervised learning; k-mean; algorithms I. INTRODUCTION Data mining is the latest interdisciplinary field of computational science. Data mining is the process of discovering attractive information from large amounts of data stored either in data warehouses, databases, or other information repositories. It is a process of automatically discovering data pattern from the massive database [1], [2]. Data mining refers to the extraction or “mining” of valuable information from large data volumes [3], [4]. Nowadays, people come across a massive amount of information and store or represent it as datasets[4], [5]. Process discovery is the learning task that works to the construction of process models from event logs of information systems [6]. Fascinating insights, observable behaviours, or high-level information can be extracted from the database by performing data mining and viewed or browsed from various angles. The knowledge discovered can be applied for process control, decision making, information management, and question handling. Decision- makers will make a clear decision using these methods to improve the real problems of this world further. In data mining, many data clustering techniques are used to trace a particular data pattern [2]. Data mining methods for better understanding are shown in Fig. 1. Clustering techniques are useful meta-learning tools for analyzing the knowledge produced by modern applications. Clustering algorithms are used extensively not only for organizing and categorizing data but also for data modelling and data compression [7]. The purpose of the clustering is to classify the data into groups according to data similarities, traits, characteristics, and behaviours [8]. Data cluster evaluation is an essential activity for finding knowledge and for data mining. The process of clustering is achieved by unsupervised, semi-supervised, or supervised manner [2]. However, there are more than 100 clustering algorithms known and selection from these algorithms for better results is more challenging. PyClustering is an open-source library for data mining written in Python and C++, providing a wide variety of clustering methods and algorithms, including bio-inspired oscillatory networks. PyClustering focuses primarily on cluster analysis to make it more user friendly and understandable. Many methods and algorithms are in the C++ namespace “ccore::clst” and in the Python module “pyclustering.cluster.” Some of the algorithms and their availability in PyClustering module is mentioned in Table I [9]. A. Clustering in Data Mining Data volumes continue to expand exponentially in various scientific and industrial sectors, and automated categorization techniques have become standard tools for data set exploration [10]. Automatic categorization techniques, traditionally called clustering, helps to reveal a dataset‟s structure [9]. Clustering is a well-established unsupervised data mining-based method [11], and it deals with the discovery of a structure in unlabeled data collection. The overall process that will be followed when developing an unsupervised learning solution can be summarized in the following chart in Fig. 2: Fig. 1. Methods of Data Mining Techniques. *Corresponding Author