Journal of Theoretical and Applied Information Technology
10
th
February 2014. Vol. 60 No.1
© 2005 - 2014 JATIT & LLS. All rights reserved
.
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
73
A NOVEL APPROACH FOR TEXT CLUSTERING USING
MUST LINK AND CANNOT LINK ALGORITHM
1
J.DAFNI ROSE,
2
DIVYA D. DEV,
3
C.R.RENE ROBIN
1
St.Joseph’s Institute Of Technology, Department Of Computer Science And Engineering, Chennai-119
2
St.Joseph’s College Of Engineering, Department Of Computer Science And Engineering, Chennai-119
3
Jerusalem College Of Engineering, Department Of Computer Science And Engineering,
Chennai-100
E-mail:
1
jdafnirose@yahoo.co.in ,
2
divyaddev@gmail.com ,
3
crrenerobin@gmail.com
ABSTRACT
Text clustering is used to group documents with high levels of similarity. It has found applications in
different areas of text mining and information retrieval. The digital data available nowadays has grown in
huge volume and retrieving useful information from that is a big challenge. Text clustering has found an
important application to organize the data and to extract useful information from the available corpus. In
this paper, we have proposed a novel method for clustering the text documents. In the first phase features
are selected using a genetic based method. In the next phase the extracted keywords are clustered using a
hybrid algorithm. The clusters are classed under meaningful topics. The MLCL algorithm works in three
phases. Firstly, the linked keywords of the genetic based extraction method are identified with a Must Link
and Cannot Link algorithm (MLCL). Secondly, the MLCL algorithm forms the initial clusters. Finally, the
clusters are optimized using Gaussian parameters. The proposed method is tested with datasets like
Reuters-21578 and Brown Corpus. The experimental results prove that our proposed method has an
improved performance than the fuzzy self-constructing feature clustering algorithm.
Keywords: Genetic Algorithm, Keyword Extraction, Text Clustering, MLCL Algorithm.
1. INTRODUCTION
Text mining is an important process in the field of
information retrieval [1]. Text mining comprises of
a wide range of processes like text clustering,
classification, text summarization and automatic
organization of text documents. Documents that are
available on the internet are increasing day by day
and most of them are loosely structured. Clustering
has become a significant and widely used text
mining tool to structure these documents so that
similar documents are clustered into the same group
and dissimilar documents are separated into
different groups. [17]
Text clustering is an unsupervised learning method
where similar documents are grouped into clusters.
It is defined as method of finding groups of similar
objects in the data. The similarity between objects
is calculated using various similarity functions.
Clustering can be very useful in various text
domains, where the objects to be clustered are of
various types such as paragraphs, sentences,
documents or terms. Clustering helps to organize
the documents which will further help to improve
information retrieval and support browsing [4].
The quality of any text mining methods such as
classification and clustering is highly dependent on
the noisiness of the features that are used for the
process. Therefore, the features should be selected
effectively to improve the clustering quality. Some
of the commonly used feature selection methods are
document frequency based selection method, term
strength and entropy based ranking. After the
features have been selected, any text mining tasks
such as classification, clustering, summarization
can be applied [4].
The basic characteristics of text document include
high dimensionality, sparsity and noisy features.
The performance of the clustering algorithms is
influenced by these properties. [17] Text clustering
finds numerous applications in customer
segmentation, classification, visualization,
document organization and indexing.
The two main classifications of clustering
algorithms are hierarchical based and K-means