DOI: 10.4018/IJeC.2020070105
International Journal of e-Collaboration
Volume 16 • Issue 3 • July-September 2020
Copyright © 2020, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
73
A New Hybrid Document Clustering
for PRF-Based Automatic Query
Expansion Approach for Efective IR
Yogesh Gupta, BML Munjal University, Haryana, India
Ashish Saini, Dayalbagh Educational Institute, India
ABSTRACT
Automatic query expansion (AQE) is an effective measure to improve information retrieval
performance by including additional terms in a user query. The pseudo relevance feedback (PRF)
method employed for AQE so far has suffered from a major problem of query drift. Therefore, keeping
it in view, a new hybrid document clustering for PRF based AQE approach is proposed in the present
article. In this, Fuzzy logic and Particle Swarm Optimization (PSO) are used to construct document
clusters. Further, a new and effective hybrid PSO and Fuzzy logic-based term weighting approach
is followed to find more suitable additional query terms using a weighted score of four IR evidences
which is considered maximized. Moreover, a combined semantic filtering method along with query
terms re-weighting algorithms are also used to remove noisy or irrelevant terms semantically. The
performance of the presented approaches in this article is tested and compared with other approaches
on three benchmark data sets. The comparative analysis of all the tested approaches illustrates the
superior performance of the proposed approach.
KEywoRDS
Automatic Query Expansion, Document Clustering, F-Measure, Fuzzy Logic, Particle Swarm Optimization,
Precision, Pseudo Relevance Feedback, Recall
INTRoDUCTIoN
Pseudo Relevance Feedback based Automatic Query Expansion methods (Attar et al., 1977; Buckley
et al., 1995; Lavrenko et al., 2001; Robertson et al., 1996) are established on a supposition that the
top extracted documents are relevant to find suitable terms from query expansion techniques. It is
usually expected in all the Information Retrieval (IR) models that the top extracted documents may
contain noise (Gupta et al., 2017). This problem may cause query expansion to ‘‘drift’’ away from
original query. Another problem with IR is the size of dataset. Nowadays, the size of datasets is being
increased exponentially and to extract the relevant documents from these huge datasets has become a
challenging task. These problems may be overcome by document clustering. Clustering algorithms
are unsupervised learning tools, which categorize documents into different clusters such that similar
types of documents (objects) are grouped into same clusters. In this way, search space to retrieve
relevant documents is reduced. The top retrieved documents after clustering contain less noise as
compared to un-clustered document-based query expansion using PRF. Therefore, a new hybrid
document clustering and PRF based AQE is proposed in this paper for text document retrieval. An