International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4153 A Novel Approach of Text Document Clustering by using Clustering Techniques Sumit Mayani 1 , Saket Swarndeep 2 1 Student of Masters of Engineering, Ahmedabad, Dept. of Computer Engineering 2 L. J. Institute of Engineering & Technology, Gujarat, India 3 Head of Department, Ahmedabad, Dept. of Computer Engineering, L. J. Institute of Engineering & Technology, Gujarat, India ---------------------------------------------------------------------***------------------------------------------------------------------ Abstract - Clustering is one of the best important unsupervised data analysis technique, which divides data objects into clusters based on similarity and summarization of datasets. Clustering has been studied and applied in many different fields, including pattern recognition, Advanced data mining, computational data science and Machine learning, information retrieval. This research focused on text document which are containing of similarities word. The combination of two algorithm methods, improved k-means and traditional k- means algorithm use to improving quality of initial cluster centres. Key Words: Text Clustering, K-means, Clustering Text Document, Text similarity. 1. INTRODUCTION Clustering is important data analysis technique, which divides data objects into clusters based on similarity and each cluster contains objects that are similar to other objects within same cluster [7] . Now a days there are many data on internet is dramatically increasing every single day by bay, clustering is considered an important data mining technique in categorizing, summarizing, classifying text documents. The data mining is extracting meaningful information or data from large datasets, the data mining techniques contains many fields like text mining, information extraction, document organization, information retrieval. Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. Data clustering refers to an unsupervised learning technique, which offers refined and more abstract views to the inherent structure of a data set by partitioning it into a number of disjoint or overlapping (fuzzy) groups. Clustering refers to the natural grouping of the data object in such a way that the objects in the same group are similar with respect to the objects present in the other groups. Document clustering is an important research direction in text mining, which aims to apply clustering algorithm on the textual data such that text documents can be organized, summarized and retrieved in an efficient way [7] . There are broadly three types of clustering, namely, Hierarchal clustering, Density based clustering, and Partition based clustering. Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. There are two types of method hierarchical clustering as Divisive and Agglomerative. The Divisive method is top-down clustering method and the observation to single cluster and then partition the cluster to two least similar clusters. The Agglomerative method is bottom-up clustering method and then compute the similarity between each of the clusters and join the two most similar clusters. The partitional clustering algorithm obtain k clusters of a set of data point without any hierarchical structure. Each cluster contains at least one object and each object belongs to exactly one cluster. Clustering methods used to classify observation, within data set, into multiple groups based on their similarity. Partitional clustering algorithm contains algorithm like k-means, k-medoids or PAM (partitioning around medoids) etc. The procedure of synthesizing the information by analyzing the relations, the patterns, and the rules among textual data - semi-structured or unstructured text. Why Text Mining? Massive amount of new information being create 80-90% of all data is held in various unstructured formats Useful information can be derived from this unstructured data. Fig -1: Text Mining Process [28]