Using Tag-Neighbors for Query Expansion in Medical Information Retrieval Frederico Durao, Karunakar Bayyapu, Guandong Xu, Peter Dolog, Ricardo Lage Department of Computer Science Aalborg University Selma Lagerl¨ ofs Vej 300 Email: fred,kreddy,xu,dolog,ricardol@cs.aau.dk Abstract—In the context of medical document retrieval, users often under-specified queries lead to undesired search results that suffer from not containing the information they seek, inadequate domain knowledge matches and unreliable sources. To overcome the limitations of under-specified queries, we utilize tags to enhance information retrieval capabilities by expanding users’ original queries with context-relevant information. We compute a set of significant tag neighbor candidates based on the neighbor frequency and weight, and utilize the most frequent and weighted neighbors to expand an entry query that has terms matching tags. The proposed approach is evaluated using MedWorm medical article collection and standard evaluation methods from the text retrieval conference (TREC). We compared the baseline of 0.353 for Mean Average Precision (MAP), reaching a MAP 0.491 (+39%) with the query expansion. In-depth analysis shows how this strategy is beneficial when compared with different ranks of the retrieval results. I. I NTRODUCTION In the context of medical document retrieval, users often under-specified queries lead to undesired search results that suffer from not containing the information they seek, inade- quate domain knowledge matches and unreliable sources. For instance, when a user wants to search for a recent outbreak of influenza on the web, a search with the query influenza will return a list of documents containing the query term, ranked by a set of criteria defined by the search engine. In this case, at least three issues may affect the quality of the search result. One, a query with only one or two terms may be under-specified, that is, it may not contain enough terms for the search engine to retrieve the desired information to the user. Second, in the document repository of the search engine, there might exist more than hundreds of thousands articles matching the requested query. In such an amount of information, it is impossible to locate the desired information by simply browsing through all contents of returned results. The third reason is related to domain knowledge require- ments. Because conventional search engines focus on generic information search, domain specific results are usually not taken into consideration during the search. Thus, a simple word based search does not produce relevant search results in specific domains such as the medical domain [1]. As a consequence of these issues related to query-based searches, only one fourth to one half of the relevant articles on a given topic are retrieved in searches performed in specific domains [2]. In other words, the sparse and incomplete query terms may result in information overload increasing the noise present in search results. Hence, the importance of refining a query is increased in such scenarios. To overcome the limitations of under-specified queries, we utilize tag neighbors to enhance information retrieval capabilities by expanding the user’s original query. Tags are free style terms to make annotations indicating the user’s own perceptions or conceptual judgments about the tagged resources. We focus on medical document collections, e.g. PubMed 1 and MedWorm 2 , because in searching these collec- tions it is often desirable to retrieve only those documents pertaining to a specific medical area. To this end, tags given by the users to the documents in the collection are typically related to the domain(s) each user is interested in. That is, users are able to choose their own free style terms (i.e. tags) which are associated to the domain(s) of their interest. The purpose of query expansion is to fill the gap between the users entered queries and extracting the relevant documents. In a nutshell, we compute a set of significant tag neighbor candidates based on the tag neighbor frequency and weight and utilize the most frequent and weighted tag neighbors to expand an entry query that has terms matching tags. For instance, if a user submits a query influenza, the query will be automatically mapped to the higher frequency tag neighbor term contagious by our method. Thus, the search will be refined by retrieving documents having the words influenza and contagious in their contents. Furthermore, neighbor terms also searchable. Take the previous query, for example, documents indexed with medical terms that include the word influenza (e.g. influenza contagious viral) will also be returned depending on the neighbor frequency and weight. In this paper, the expansion terms we used are selected from a large amount of tags provided by the users. Then we propose to use the tag neighbors method for a high frequency term selection. Based on this method we tried to choose good expansion terms from the candidate neighbors, according to their potential impact on retrieval effectiveness. We implement our method in a search system with contents extracted and indexed from the MedWorm medical article database. We 1 www.ncbi.nlm.nih.gov/pubmed 2 www.medworm.com 978-1-4244-9224-4/11/$26.00 ©2011 IEEE