Vol.:(0123456789) 1 3 Evolving Systems https://doi.org/10.1007/s12530-019-09292-7 ORIGINAL PAPER Employing query disambiguation using clustering techniques Andreas Kanavos 1 · Panagiota Kotoula 1 · Christos Makris 1 · Lazaros Iliadis 2 Received: 27 November 2018 / Accepted: 3 July 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Due to the boundless expansion of the Web in the last decade, the research community has paid signifcant attention to the problem of efective searching in the vast information available. In this paper, we introduce a novel framework for improving information retrieval results. Initially, relevant documents are organized in clusters utilizing several metrics combined with language modelling tools. In following, a produced ranked list of the documents is returned to the user for a specifc query. This is implemented as the scores between the clusters and the query representations are extracted; next in line, the internal rankings of the documents, per cluster, using these scores as weighting factor, are combined. Our proposed methodology is based on the exploitation of the inter-documents similarities (lexical and/or semantics) after a sophisticated pre-processing step. Our experimental evaluation demonstrates that the proposed algorithm can efciently improve the quality of the retrieved results. Keywords Query disambiguation · Information retrieval · Query reformulation · Clustering · Containment · Semantics 1 Introduction Search engines constitute tools of inestimable value in order to retrieve information from the Web. However, when mixed together in the answer list, they are not efcient in present- ing ambiguous queries that usually result in web page ref- erences mapped to diferent meanings. More specifcally, extracting knowledge and grouping the results returned by a search engine into groups or a hierarchy of labelled clus- ters, is a very important task that modern search engines have recently started taking into consideration. 1 With the use of category clustered results, the user may focus on a general topic by entering a generic query and then selecting the results that better match his interest. As one of the most popular research issues, one can con- sider the subject of improving the quality of ranking in Infor- mation Retrieval results. To this extent, information need is expressed through the form of queries submitted against a search engine or platform with the purpose of receiving any available information related to the query (Baeza-Yates and Ribeiro-Neto 2011; Manning et al. 2008). The problem, as well as the challenge in this process, is the potential and the capability of the search machine to respond and in following to deliver the fttest set of information for the specifc query, if this information actually exists. On the other hand, users that post their queries do not have the corresponding experience and thus cannot be con- sidered as appropriate enough of the best format to provide their input query. One potential reason can be either because they cannot express their intention clearly or because they do not leverage the full potential of the search platform. The search engine’s greatest challenge is then, to understand users’ intention through this given input, or in other words, the query itself, that is to disambiguate the terms that syn- thesize the query and attempt to satisfy the query request. A preliminary version of this paper was presented in 14th International Conference on Artifcial Intelligence Applications and Innovations, AIAI 2018, Rhodes, Greece, May 25–27, 2018. * Andreas Kanavos kanavos@ceid.upatras.gr Panagiota Kotoula kotoula@ceid.upatras.gr Christos Makris makri@ceid.upatras.gr Lazaros Iliadis liliadis@civil.duth.gr 1 Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece 2 Department of Civil Engineering, Democritus University of Thrace, 67100 Xanthi, Greece 1 Google: https://www.google.com/search/about/.