International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4153
A Novel Approach of Text Document Clustering by using
Clustering Techniques
Sumit Mayani
1
, Saket Swarndeep
2
1
Student of Masters of Engineering, Ahmedabad, Dept. of Computer Engineering
2
L. J. Institute of Engineering & Technology, Gujarat, India
3
Head of Department, Ahmedabad, Dept. of Computer Engineering, L. J. Institute of Engineering & Technology,
Gujarat, India
---------------------------------------------------------------------***------------------------------------------------------------------
Abstract - Clustering is one of the best important
unsupervised data analysis technique, which divides data
objects into clusters based on similarity and summarization of
datasets. Clustering has been studied and applied in many
different fields, including pattern recognition, Advanced data
mining, computational data science and Machine learning,
information retrieval. This research focused on text document
which are containing of similarities word. The combination of
two algorithm methods, improved k-means and traditional k-
means algorithm use to improving quality of initial cluster
centres.
Key Words: Text Clustering, K-means, Clustering Text
Document, Text similarity.
1. INTRODUCTION
Clustering is important data analysis technique, which
divides data objects into clusters based on similarity and
each cluster contains objects that are similar to other objects
within same cluster
[7]
. Now a days there are many data on
internet is dramatically increasing every single day by bay,
clustering is considered an important data mining technique
in categorizing, summarizing, classifying text documents.
The data mining is extracting meaningful information or
data from large datasets, the data mining techniques
contains many fields like text mining, information
extraction, document organization, information retrieval.
Data mining is the process of analyzing data from different
perspectives and summarizing it into useful information.
Data clustering refers to an unsupervised learning
technique, which offers refined and more abstract views to
the inherent structure of a data set by partitioning it into a
number of disjoint or overlapping (fuzzy) groups. Clustering
refers to the natural grouping of the data object in such a
way that the objects in the same group are similar with
respect to the objects present in the other groups. Document
clustering is an important research direction in text mining,
which aims to apply clustering algorithm on the textual
data such that text documents can be organized,
summarized and retrieved in an efficient way
[7]
. There are
broadly three types of clustering, namely, Hierarchal
clustering, Density based clustering, and Partition based
clustering.
Hierarchical clustering involves creating clusters that have
a predetermined ordering from top to bottom. There are
two types of method hierarchical clustering as Divisive and
Agglomerative. The Divisive method is top-down clustering
method and the observation to single cluster and then
partition the cluster to two least similar clusters. The
Agglomerative method is bottom-up clustering method and
then compute the similarity between each of the clusters
and join the two most similar clusters. The partitional
clustering algorithm obtain k clusters of a set of data point
without any hierarchical structure. Each cluster contains at
least one object and each object belongs to exactly one
cluster. Clustering methods used to classify observation,
within data set, into multiple groups based on their
similarity. Partitional clustering algorithm contains
algorithm like k-means, k-medoids or PAM (partitioning
around medoids) etc.
The procedure of synthesizing the information by analyzing
the relations, the patterns, and the rules among textual data
- semi-structured or unstructured text. Why Text Mining?
Massive amount of new information being create 80-90% of
all data is held in various unstructured formats Useful
information can be derived from this unstructured data.
Fig -1: Text Mining Process
[28]