International Journal of Science and Research (IJSR) ISSN: 2319-7064 ResearchGate Impact Factor (2018): 0.28 | SJIF (2019): 7.583 Volume 9 Issue 5, May 2020 www.ijsr.net Licensed Under Creative Commons Attribution CC BY A Review of Big Data Clustering Methods and Research Issues Nweso Emmanuel Nwogbaga 1 Department of Networking and Communication, Faculty of Computer Science, University Putra Malaysia Department of Computer Science, Faculty of Science, Ebonyi State University, Nigeria Abstract: Data mining is a method for knowledge discovery from a dataset. The world today is moving toward data-driven in all ramifications, ranging from education, health care, security, customers’ management, smart city, etc. Unsupervis ed learning like clustering is the most big-data mining technique used for grouping large dataset when there is no prior information about the classes in the dataset. The use of the internet of things (wearable, sensors, RFID) and social networks has drastically increased data in the cyber- physical world resulting in what is called Big Data. With the increase in big data as a result of cloud computing, it has proliferated research on knowledge discovery on these avalanche of big data. Clustering is used to extract valuable hidden information from massive complex data. Clustering as unsupervised learning has an advantage over supervised learning when it comes to knowledge discovery in a huge dataset without a prior knowledge of the groups. In this review, we discussed big data mining techniques and narrowed it to clustering method. We also discussed different clustering approaches, and similarities measures used in clustering algorithms. Finally, we discussed the strength and weaknesses of clustering approaches and the research issues in clustering big data for information discovery. Keywords: Big Data, Big Data Mining, Clustering, IoT Big Data Clustering, Distance/Similarity Measures, Unsupervised Learning 1. Introduction The world is moving toward data-driven decision making (Provost & Fawcett, 2013). This implies basing our decisions/actions on the available data around us, for almost everything we do. There are several ways data are being generated from our environment today, ranging from sensors, cameras, Internet traffic generated stream data and other devices. These devices and platforms generate a lot of text, image, audio and video data. This different modality of data results in what is known as Big Data. The benefit of cloud computing as discussed in (Nwogbaga, 2016) also proliferated data generation, because the users careless about processing resources. Internet of things (IoT) is another source of data generation. The use of IoT in vehicular network as presented in (Eze, Sijing, Liu, Nwogbaga, & Eze, 2016; Eze, Zhang, Liu, Nwogbaga, & Eze, 2016), generates huge amount data worldwide daily. Big data imposes challenges of identifying the underlying pattern, groups or hidden information about the data set. Analytics of these big data requires efficient data mining technique posing a lot of challenges in processing and analytics (C. c. Aggarwal, 2015). The big data mining processes through these devices involves different stages (Che, Safran, & Peng, 2013). Characteristics of big data are:- high volume, different types (variety), data quality ranging from incomplete data, noise data, etc. (veracity) and are collected at high speed (velocity) (J. Chen et al., 2013; Kuang et al., 2014). 1.1 Big Data The term ―Big Data‖ was first used in 1998 according to (Kitchin & McArdle, 2016; Thakur & Mann, 2014) by John Mashey in a Silicon Graphics (SGI) slide deck. Big Data is the dataset that is beyond the ability of current data processing technology (J. Chen et al., 2013; Riahi & Riahi, 2018). Big data plays a critical role in all areas of human endevour. For instance, governments are now mining the contents of social media networks, blogs, and other online transactions to identify the necessity for government facilities or to recognize the organizational groups and their activities and to predict relevant future events such as threats or promises. Service provider in the other hand track their customers’ purchases made through online, in-store, and customer’ behaviour through streams of online clicks for improving their marketing and predicting the growth of their profits and increase customers satisfaction (Che et al., 2013). The gap between the big data management and the capabilities of the current DBMSs can offer has reached the historically high peak. The major three characteristics (Volume, Variety and Velocity) of big data, each mean one distinct deficiencies of the present DBMSs. Large volume requires great scalability and massive parallelism that are beyond the capability of present DBMSs; high variety of data types of big data are mostly not compatible with architecture of current database systems. The velocity of big data especially stream data processing needs appropriate real-time efficiency which is far beyond the current DBMSs (Madden, 2012). 1.2 Characteristics of big data Different authors characterize big data from different perspectives, (Hadi, Lawey, El-gorashi, & Elmirghani, 2018; Russom, 2011) characterized big data with 3Vs as Volume, Variety, and Velocity. Presently big data is a volatile term that has a different definition from different perspectives (De Mauro, Greco, & Grimaldi, 2016; Ylijoki & Porras, 2016). Others like (Mao, Hu, & Kumar, 2018) characterized big data with 4Vs as Volume, Variety, Velocity, and veracity. (Sami & Sael, 2016) Characterized big data with 9Vs as Volume, Variety, Velocity, veracity, validity, volatility, variability, visualization, and value as in table 1. Paper ID: SR20502183559 DOI: 10.21275/SR20502183559 253