Copyright © 2018 Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
International Journal of Engineering & Technology, 7 (4.36) (2018) 147-153
International Journal of Engineering & Technology
Website: www.sciencepubco.com/index.php/IJET
Research paper
A Survey on Clustering Density Based Data Stream algorithms
Mayas Aljibawi*, Mohd Zakree Ahmed Nazri , Zalinda Othman
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia 43600 Bangi,
Selangor Darul Ehsan, Malaysia
*Corresponding author E-mail: mayasaljibawi@gmail.com
Abstract
With the rapid evolution of technology, data size has increased as well. Thus, open the door to a new challenge of finding patterns such as
the limitation of memory and time and the one pass to the whole data. Many clustering techniques has been developed to overcome these
issues. Streaming data evolve with time, and that makes it almost impossible to define clusters number in that data. Density-based algorithm
is one of the significant data clustering class to overcome this issue due to it doesn’t require an advance knowledge about the number of
clusters. This paper reviewed some of the existing density-based clustering algorithms for the data stream with the measurement used to
evaluate the algorithm.
Keywords data mining, clustering, density-based clustering, grid-based clustering, micro-clustering, stream data clustering.
1. Introduction
The rapid development in the technology make the data size
collected from various sources very large. For example, the genome
of a single human been can hold up to 4 gigabytes of data space [1],
and the amount of data that we create every day reach up to 2.5
quintillion bytes [2].Another huge amount of data can be
continually generated from the streaming via different applications.
Stream data mining which is referring to extract the structure of the
knowledge from the stream, is attracting many researchers because
of growing of data stream generation and its application importance
[3]. Traditional approaches used to analysis the data are not suitable
anymore to be used with the massive amount of the new data.
Therefore, demands for new approaches to extract the important
information from that data are needed, with a robust techniques for
examining, explaining data the get the relevant knowledge that
assists in the decision making.
2. Data mining and data clustering
2.1 Data mining
It is the method of extracting the unidentified relevant pattern such
as unusual records (anomaly detection), cluster analysis and
dependencies [4, 5]. Many definitions for the data mining
mentioned in the literature are discussed below:
[6] Defines Data mining as the approach of finding essential
connections, patterns, by moving through the data stored in
depository. [4] Says, it is the process of processing voluminous data
stored in the database, seeking for patterns and affiliation within
that data. [7] Gives another definition for the data mining as the
process of picking, discovering, and modeling huge amounts of data
to discover previously anonymous patterns of a business advantage.
2.2 Data clustering:
Clustering is most suitable techniques to distribute the data into
groups of similar objects which are closely related and different
with other groups’ objects. The clustering approaches smoothly
arrange a set of patterns into the group or clusters on the basis of
similarity measures. Cluster techniques are based on an
unsupervised approach where data items are unlabeled to group
them into valid clusters [4, 5], while in unsupervised approaches,
the dataset is given in the form of pre-classified item set. If the
dataset is already labeled it help us to create a new label.
Figure 1 data mining steps
• Clustering: is the process where the data points been
partitioning into smaller groups. Each of the formed groups
represent a cluster where the objects are similar to each other, while
dissimilar to other cluster’s objects. The results from this process
referred to as a clustering [3].
• Requirements for Cluster Analysis
➢ Scalability: a lot of literature algorithms can handle small
datasets, while databases nowadays consist of millions of objects,
that makes high scalability is a must in the clustering algorithm.
➢ Handling different types of attributes: algorithms
normally developed to deal with one type of data (numeric, binary,
nominal, etc.). However, many applications start to require
clustering algorithm for complex types of data.
➢ Discover clusters with different shapes: clustering
algorithms usually use either the Euclidean or Manhattan for
measuring the distance, then determine the shape of the clusters
which normally will be a similar size and density spherical shape
cluster. However, the shape of the clusters could be various (e.g.