International Journal of Computer Applications (0975 – 8887) Volume 184 – No.9, April 2022 21 Reduce Noise in K-Mean Clustering using DBSCAN Algorithm Manjur Ahammad Dept. of Computer Science & Engineering United International University Dhaka- 1212, Bangladesh Faija Juhin Dept. of Computer Science & Engineering United International University Dhaka- 1212, Bangladesh Dewan Md. Farid Dept. of Computer Science & Engineering United International University Dhaka- 1212, Bangladesh ABSTRACT The growth of data mining procedure is increasing day by day. We can extract useful insight from data. For mining data different techniques and tools have been introduced every day. By gaining knowledge from those insight of the data many research paper is being written. Based on the behavior, pattern and the characteristics data are being clustered into different groups. For clustering these massive amount of data we use different types of algorithms and techniques. The most common types of algorithms that are used in clustering are partitioning, hierarchical, grid-based and model-based algorithms. To handle these data another, type of algorithms are K-means clustering, density-based algorithm, similarity- based algorithms etc. the agenda off these algorithms are different. Some performs well for nominal data, some for categorical or ordinal data, contrariwise some can remove duplicate or noisy data and some can‘t do so. In this paper a method has been showed that how can we cluster a dataset and remove the noisiness of that particular dataset at the same time. General Terms Algorithm Keywords Big Data, K-Means Clustering, DBSCAN, OPTICS 1. INTRODUCTION Data mining is a procedure of removing and noticing patterns in large data sets including methods at the intersection of machine learning, statistics, and database systems. There are few methods which are not useful for dataset. The high streaming data that are produced every minute by the applications of Internet of Things (IoT) is playing a vital role in everyone‘s life. From these tremendous amount of data high quantity of pattern, behavior and characteristics can be extracted. [1], [2]. The capacity of storing these huge amount of data is still a challenging task as well as time consuming. Sometimes these data are being called networking data because of the dependency with one data to another. There are various aspects such as biotechnology, machine learning, IoT and other sources where these data can be used. [3]. Basically these clustering formula handle these large amount of data. Based on the similarity and dissimilarity of data, it‘s been clustered for analyzing and for finding the hidden behavior of the data [4]. For example, when someone researches on marketing for a particular product, that individual will be able to gain some information about how frequently consumers are buying that particular product and the if the product is capable enough to hold the popularity among the buyers and that‘s how these hidden pattern help people in taking decisions in their business. There are various types of clustering algorithms exist such as K-means clustering, Similarity-based clustering, Density-based clustering, Distanced-based clustering, Hierarchical clustering etc. K-means clustering is one of the most popular algorithms among all these clustering methods. K-means clustering algorithm calculates the centroids and repeats until we it catches optimal centroid. The data points are allocated to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum, in this algorithm. However, K-means clustering is time consuming because of the computational steps of the algorithm. On the other hand, it‘s impossible to remove duplicate data or noisy data in K-means clustering technique. In this paper authors are going to share an idea of clustering a dataset and removing noisy data at the same time in K-means clustering. The paper is arranged as: the second section signifies the challenges of Big Data clustering. A view of the various clustering techniques has been provided in the third section. Fourth section demonstrates the proposed idea. Finally, fifth section concludes the paper. 2. BIG DATA CLUSTERING CHALLENGES. On the web data is growing unbelievably because of the use of the internet since the number of internet users are growing. With the number of new data sources, the complexity of data is also increasing. There are three categories for this huge streaming data– structured data, unstructured data and semi- structured data. There are 6Vs that define big data. Volume: the size of data is growing in the blink of eyes. Today the size of those data is larger than terabytes and petabytes. Variety: this defines the types of data. A data can be anything, such as- audio, video, image, written document etc. Velocity: it‘s the generating and processing speed of a data. Value: it‘s the information a person can obtain from the data. Veracity: the trustworthiness or the reliability of the data. Validation: it shows if the purpose of the data has been served or not. To cluster these massive data according to their respective group is not an easy task. The potentialchallenges have been identified as follows: The identification of distance measure: Euclidian, Manhattan, and maximum distance measure can be used for numerical attributes as a distance measure technique. But for categorical elements this identification measure is problematic. Lack of class labels: the distribution of data has to be done to