IIT TREC-2007 Genomics Track: Using Concept-based Semantics in Context for Genomics Literature Passage Retrieval Jay Urbain Information Retrieval Laboratory Computer Science Department Illinois Institute of Technology Chicago, IL urbajay@iit.edu Nazli Goharian Information Retrieval Laboratory Computer Science Department Illinois Institute of Technology Chicago, IL goharian@iit.edu Ophir Frieder Information Retrieval Laboratory Computer Science Department Illinois Institute of Technology Chicago, IL frieder@iit.edu Abstract For the TREC-2007 Genomics Track [1], we explore unsupervised techniques for extracting semantic information about biomedical concepts with a retrieval model for using these semantics in context to improve passage retrieval precision. Dependency grammar analysis is evaluated for boosting the rank of passages where complementary subject/object concept pairs can be identified between queries and sentences from candidate passages. In our model, a concept is represented as a set of synonymous terms and a concept-word distribution. Concept terms are identified using an information extraction technique relying on shallow sentence parsing, external knowledge sources, and document context. The system combines a dimensional data model for indexing scientific literature at multiple levels of document context, with a rule-based query processing algorithm. The data model consists of two hierarchical indices: one for individual words and a second for extracted concepts. The word index provides retrieval of single or multi-word terms. The concept index provides efficient retrieval of single or multiple independent concepts. A retrieval function combines concepts with term statistics at multiple levels of context to identify relevant passages. Finally, we boost the relevance score of sentences identified within a passage where we can identify term dependencies that complement subject/object pairs between query and passage sentences via dependency grammar analysis. Our objective for this year’s forum was to improve passage retrieval precision. We submitted three automatically generated results for three variations of our retrieval model to the TREC forum. The three results exceeded the track median for character based passage retrieval by 75 to 93%. The mean average precision (MAP) for our top passage retrieval model was 0.0940 which compares favorably to the top result of 0.0976. 1. Introduction Information retrieval in the genomics literature domain is challenging due to the wide variation of synonymous terms, acronyms, and morphological variants used for identifying the same biological concepts. In addition, acronyms frequently have multiple meanings (polysemy) and require contextual clues for accurate disambiguation. For example, the terms “bovine spongiform encephalopathy”, “BSE”, and “Mad Cow Disease” are all different terms representing the same named entity or concept. Search terms also have much higher relevance when matched against document terms when occurring within the local context of a phrase, sentence, or passage of text. An acronym like “IP” could represent “immunoprecipitant” or “ischemic precondition.” In this case, context captured at the paragraph or document level where an acronym is defined can help disambiguate its meaning [2]. Databases from the National Center for Biotechnology Information (NCBI) [3] and other sources can be helpful in providing semantic evidence supporting identification and extraction of named biological entities. However, it is important to recognize that no knowledge source can fully capture the complexities of human language let alone be fully up-to-date with the dynamic vocabulary of an evolving science. In most cases, there are varying levels of semantic evidence which can make accurate identification of biological concepts difficult. In these cases, optimal retrieval solutions need to integrate additional sources of evidence including identification of key phrases and terms within context. We propose that effective search requires a systematic approach for combining semantic and contextual evidence. Our approach relies on an indexing model that supports search of single and multi-word terms to support identification of concept term variants, search at different levels of document structure for identifying terms and concepts within context, and integration of external knowledge sources to aid in the identification and resolution of named biological entities and related