Generating GO Slim Using Relational Database Management Systems to Support Proteomics Analysis Getiria Onsongo 1 , Hongwei Xie 2 , Timothy J. Grifﬁn 2 , John Carlis 1 [1] Dept of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN 55455. [2] Dept of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, 321 Church St. Minneapolis, MN 55455 Abstract The Gene Ontology Consortium built the Gene Ontology database (GO) to address the need for a common standard in naming genes and gene products. Using different names for the same concepts and different concepts with the same name makes it effectively impossible for humans and com- puters alike to analyze biological processes across different organisms. The consortium addresses this need by deﬁning terms for categorizing genes and gene products. A conven- tion in GO is that each gene or gene product is annotated to the most speciﬁc GO term in the GO database. It is, how- ever, also useful for researchers to be able to group genes or gene products into broad biological categories that give a higher-level view of their function when analyzing results of an experiment. A GO Slim is a subset of the GO ontology that provides such a higher-level view of functions. Existing GO Slim generation tools have two important limitations: programming language dependence, and an inability to dy- namically generate a GO Slim while analyzing. We have ex- tended the relational database engine to dynamically gen- erate a GO Slim overcoming this limitations. Using this extention, we have developed a tool (DynamicGOSlim) that dynamically generates a GO Slim and uses the generated GO Slim to categorize genes or gene products. This tool is being used in an ongoing proteomics project aimed at iden- tifying possible oral cancer biomarkers in saliva. 1. Introduction The discovery nature of biological science naturally leads to scientists naming what they ﬁnd. However, giving different names to what turns out to be the same concept and giving different concepts the same name impedes science, making it effectively impossible for humans and computers alike to analyze biological concepts within and especially across different organisms. The Gene Ontology Consor- tium was formed to help reduce this babel, speciﬁcally to “ produce a dynamic, controlled vocabulary that can be ap- plied to all eukaryotes even as knowledge of gene and pro- tein roles in cells is accumulating and changing ” [1]. The consortium has successfully encouraged the disciplined use of a common language by establishing, by consensus, a re- stricted vocabulary, making it publicly available in the Gene Ontology database, GO, and providing mechanisms for its periodic update. Now it is common for GO terms to be used in the research literature and public databases [2], [5]. GO consists of genes and gene products plus certain con- cepts, called terms, associated with them, and, in addition, other data that is not relevant here. GO organizes terms and parent-child relationships between terms into three sep- arate ontologies for biological processes, molecular func- tions and cellular components. Each ontology forms a di- rected acyclic graph, DAG, with each node being a term and each parent-child relationship being a directed arc between distinct nodes. In GO each child term is a more speciﬁc pro- cess, function or component than each of its parent terms. An association connects a gene or gene product with the most speciﬁc possible term, and implicitly applies to the term´ s ancestors. Collectively, the genes and gene products associated with a term are called its annotation. Figure 1 shows a small portion of GO, with terms appearing inside rectangles, genes or gene product associated with a term ap- pearing inside ellipses attached to its rectangle, and parent term - child term relationships appearing as arrows. 1.1. GO Slim Tools to produce variants of GO called GO Slims were developed because, for some tasks, such as analyzing the results of an experiment, two characteristics of GO make it less than ideal. First, users may be interested in only a small portion of the entire database and masses of irrelevant 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.77 215 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.77 215 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.77 215 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.77 215