1 LITSEEK: Public Health Lit erature S earch by Metadata Enhancement with E xternal K nowledge Bases Priyanka Prabhu 1 priyanka.prabhu@gatech.edu Shamkant Navathe 1 sham@cc.gatech.edu 1 College Of Computing Georgia Institute Of Technology Atlanta, GA 30332-0280 Tel #: +1- 404-385-2892 Stephen Tyler 1 st144@mail.gatech.edu ABSTRACT Biomedical literature is an important source of information in any researcher’s investigation of genes, risk factors, diseases and drugs. Often the information searched by public health researchers is distributed across multiple disparate sources that may include publications from PubMed, genomic, proteomic and pathway databases, gene expression and clinical resources and biomedical ontologies. The unstructured nature of this information makes it difficult to find relevant parts from it manually and comprehensive knowledge is further difficult to synthesize automatically. In this paper we report on LITSEEK (LITerature S earch by metadata Enhancement with E xternal K nowledgebases), a system we have developed for the benefit of researchers at the Centers for Disease Control (CDC) to enable them to search the HuGE (Human Genome for Epidemiology) database of PubMed articles, from a pharmacogenomic perspective. Besides analyzing text using TFIDF ranking and indexing of the important terms, the proposed system incorporates an automatic consultation with PharmGKB - a human-curated knowledge base about drugs, related diseases and genes, as well as with the Gene Ontology, a human-curated, well accepted ontology. We highlight the main components of our approach and illustrate how the search is enhanced by incorporating additional concepts in terms of genes/drugs/diseases (called metadata for ease of reference) from PharmGKB. Various measurements are reported with respect to the addition of these metadata terms. Preliminary results in terms of precision based on expert user feedback from CDC are encouraging. Further evaluation of the search procedure by actual researchers is under way. Categories and Subject Descriptors H.2.4 [Database Management Systems] –Textual Databases H.3.3 [Information Search and Retrieval] – Information Filtering, Search Process, Selection General Terms Our general terms are: Algorithms, Design. Keywords Metadata integration, text mining, information retrieval, pharmacogenomics, knowledge bases, search. Copyright is held by the author/owner(s). DTMBIO’09, November 6, 2009, Hong Kong, China. ACM 978-1-60558-803-2/09/11. 1. ENHANCEMENT OF BIOMEDICAL LITERATURE SEARCH 1.1 The Problem Biomedical knowledge search typically requires manual use of external knowledge bases (like ontologies and databases) in conjunction with bibliographic literature to link and relate knowledge for research purposes. Particularly, there is a specific information need about genes, diseases and drugs and their mutual relationships after one views articles from systems like PubMed. For example, a user searching for a specific gene BRCA1 may not find articles containing the alternate gene names of BRCA1. Also, he may be interested in articles containing the drugs and diseases associated with the BRCA1 gene. However, these articles may not necessarily contain the term BRCA1. The challenges in this context are as follows: i) Identification of genes, drugs and diseases from the unstructured natural language text. ii) Linking the entities of interest found in step 1 with external information about them. iii) Integration of structured (relational) and unstructured (plain text) data from multiple sources. 1.2 The Solution We propose to solve this by integrating actual data and the specific information of interest from multiple external sources with zero manual effort. Our solution is based on metadata enhancement and query expansion where an integrated search engine aims to improve the search experience for a researcher. LITSEEK integrates itself with pharmacogenetic knowledge bases, namely the PharmGKB database used for query expansion and it provides further knowledge from Gene Ontology [Ashburner et al. 2000.]. We use the HuGE dataset [Yu et al. 2008.] as a literature database which has been compiled and curated by scientists at the CDC for epidemiological research. The HuGE database has about 20,000 articles currently. We have developed an automated classifier approach [Polavarapu et al. 2005] to assist a human expert select articles for HuGE from PubMed. The primary features of our search system are as follows: 1. For a given query term, we give an option to the user to retrieve the primary set of articles strictly based on (i) only Titles or (ii) Title and Abstracts.