Toward An Improved Concept-Based Information Retrieval System 1 Peter V. Henstock, Daniel J. Pack 2 , Young-Suk Lee, and Clifford J. Weinstein MIT Lincoln Laboratory 244 Wood St. Lexington, MA 02420-9185 ABSTRACT This paper presents a novel information retrieval system that includes 1) the addition of concepts to facilitate the identification of the correct word sense, 2) a natural language query interface, 3) the inclusion of weights and penalties for proper nouns that build upon the Okapi weighting scheme, and 4) a term clustering technique that exploits the spatial proximity of search terms in a document to further improve the performance. The effectiveness of the system is validated by experimental results. Keywords Brill Tagger, Concept, Information Retrieval, Okapi, Roget’s Thesaurus, Word Sense Disambiguation, WordNet. 1. DESCRIPTION OF THE IR SYSTEM The information retrieval (IR) system described in this paper is part of the CCLINC system developed by MIT Lincoln Laboratory [1,6,7]. It is based on the Okapi [5] system from TREC-3 with enhancements that include 1) the addition of concepts to facilitate the identification of the correct word sense, 2) a natural language interface, 3) the inclusion of weights and penalties for proper nouns within the Okapi framework, and 4) a term clustering technique that exploits the spatial proximity of search terms in a document. The system was evaluated on a set of English documents from a domain-specific corpus. Since identifying the correct sense of a word is of fundamental importance to achieving good IR performance [4], our system assigns concepts that represent the word sense or semantic meaning to all query and corpus words. The concepts were derived from the 1911 edition of Roget's Thesaurus that contained approximately 1000 categories. Roget's semantic framework was selected since it contains a large number of nouns, verbs, adjectives, and adverbs within a unified categorization across parts of speech (POS). The concepts were supplemented by domain-specific and technological entries, as well as entries for other POS words to total 1050 concepts. The lexicon included words from Roget’s thesaurus, WordNet [3], and various wordlists to total over 200,000 words. The interface to the IR system is based on natural language queries, as shown in the query understanding block in Figure 1. Each query is parsed to form a semantic frame from which both the query words and their associated concepts are extracted. The natural language queries not only provide a friendly interface, but, when combined with the semantic frame, ensure accurate concept labeling that is crucial for determining the correct word senses, especially in IR. The documents in the corpus are processed through a sequence of steps that assign concepts to each word. The steps include special handling of idiomatic expressions, and morphological stemming for all adjectives, nouns, and verbs. Each word is then labeled with a concept using the Brill tagger [2]. Although this tagger is commonly used only for assigning a small set of POS tags, we have demonstrated over 92% accuracy in assigning the 1050 concept tags for domain-specific documents. Furthermore, the majority of the tagging errors occur with “have”, “be”, “go”, “may”, and “do” that are in the stop list and consequently have no bearing on IR performance. The assignment of concepts represents a form of shallow natural language understanding that requires more processing than word matches alone, but significantly less computation than full parsing. Figure 1. Block Diagram of the Query Processing and IR For the Frequency Weighted Search module of the IR block shown in Figure 1, the Okapi algorithm was selected as the baseline algorithm since its use has demonstrated consistently strong performance. The Okapi equation that consists of the sum of frequency weighted query words was extended to include the sum of the frequency weighted concepts. Experimental results for combining words and concepts determined that treating each as separate terms within the Okapi framework gave the best results as shown in Equation 1. Including a multiplicative weight to the current Okapi weight for proper nouns further improved the overall performance. In addition, applying a penalty (penalty A ) to documents lacking the proper noun terms contained in the query proved effective for reducing the sensitivity of the performance to Copyright 2001 Association for Computing Machinery. ACM acknow- ledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SIGIR ’01, September 9-12, 2001, New Orleans, Louisiana, USA. Copyright 2001 ACM 1-58113-331-6/01/0009…$5.00. ________________________________ 1 This work was sponsored by Defense Advanced Research Project Agency under Air Force contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense. 2 Daniel J. Pack is an associate professor of Electrical Engineering at the United States Air Force Academy on his sabbatical leave. ScoreA ScoreB Input Query Semantic Frame Parsing Freq. Weighted Search Concept Categories Search on Tagged Corpus Database Access Proximity Search Information Retrieval Query Understanding 384