Enabling Enrichment Analysis with the Human Disease Ontology Paea LePendu, PhD * , Mark A. Musen, MD, PhD, and Nigam H. Shah, MBBS, PhD Stanford Center for Biomedical Informatics Research 251 Campus Drive Medical School Office Building, Room X215 Mail Code 5479 Stanford University Stanford, CA 94305-5479, USA Abstract Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of “significant genes.” One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene- set, and is widely used to make sense of the results of high-throughput experiments. Our goal is to develop and apply general enrichment analysis methods to profile other sets of interest, such as patient cohorts from the electronic medical record, using a variety of ontologies including SNOMED CT, MedDRA, RxNorm, and others. Although it is possible to perform enrichment analysis using ontologies other than the GO, a key pre-requisite is the availability of a background set of annotations to enable the enrichment calculation. In the case of the GO, this background set is provided by the Gene Ontology Annotations. In the current work, we describe: (i) a general method that uses hand-curated GO annotations as a starting point for creating background datasets for enrichment analysis using other ontologies; and (ii) a gene–disease background annotation set—that enables disease-based enrichment—to demonstrate feasibility of our method. Keywords Enrichment Analysis; Human Disease; Ontology; Annotation; Information Integration 1. Introduction One way to gain insight into the significance of a particular set of genes is to determine whether functional terms that are associated with each gene are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to make sense of the results of high-throughput experiments such as geneexpression assays. The canonical example of enrichment analysis is in the interpretation of a list of differentially expressed genes in some condition. The usual approach is to perform enrichment analysis with the Gene Ontology (GO). We can aggregate the annotating GO concepts associated with a particular biological process, © 2011 Elsevier Inc. All rights reserved * Corresponding Author plependu@stanford.edu 650-721-5821 fax 650-725-7944. Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Conflicts of Interest The authors declare that there are no conflicts of interest. NIH Public Access Author Manuscript J Biomed Inform. Author manuscript; available in PMC 2012 December 01. Published in final edited form as: J Biomed Inform. 2011 December ; 44(Suppl 1): S31–S38. doi:10.1016/j.jbi.2011.04.007. NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript