Cognitive Computation (2019) 11:271–293
https://doi.org/10.1007/s12559-018-9611-8
Automatic Scientific Document Clustering Using Self-organized
Multi-objective Differential Evolution
Naveen Saini
1
· Sriparna Saha
1
· Pushpak Bhattacharyya
1
Received: 6 April 2018 / Accepted: 12 November 2018 / Published online: 19 December 2018
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Document clustering is the partitioning of a given collection of documents into various K - groups based on some
similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of
some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and
classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic
document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential
evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine
the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during
evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique.
In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik
index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-
organizing map based multi-objective document clustering technique (SMODoc clust) is shown in automatic classification
of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are
employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely,
Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-
objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA)
based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better
than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.
Keywords Clustering · Cluster validity indices · Self Organizing Map (SOM) · Differential Evolution (DE) · Polynomial
mutation · Multi-objective Optimization (MOO)
Introduction
Background
Document clustering [1] refers to partitioning of a given
collection of documents into various K-groups based
Naveen Saini
naveen.pcs16@iitp.ac.in; naveen.pcs16@gmail.com
Sriparna Saha
sriparna@iitp.ac.in
Pushpak Bhattacharyya
pb@iitp.ac.in
1
Department of Computer Science and Engineering, Indian
Institute of Technology Patna, Patna, 801103 Bihar, India
on some similarity/dissimilarity criterion so that each
document in a group is similar to other documents in the
same group. Various applications of document clustering
include: extraction of relevant topics [12], organization of
documents as in digital libraries [63], creation of document
taxonomy [22] such as in Yahoo, document summarization
[25] etc. For the purpose of clustering, the value of K
may or may not be known a priori. To determine the value
of K in the collection of documents, traditional clustering
approaches [44] like K-means [31], bisecting K-means [59],
hierarchical clustering techniques [31] are required to be
executed multiple times with various values of K. The
qualities of different partitionings are measured with respect
to some cluster validity indices, measuring the goodness of
a partitioning by monitoring different intrinsic properties
of clusters. Finally, the partitioning which corresponds to
the optimal value of any cluster validity index is selected