Auto-Extraction, Representation and Integration of a Diabetes Ontology using Bayesian Networks Ken McGarry ∗† , Sheila Garfield and Stefan Wermter School of Computing and Technology, University of Sunderland, UK School of Pharmacy, University of Sunderland, UK {ken.mcgarry,sheila.garfield,stefan.wermter}@sunderland.ac.uk Abstract This paper describes how high level biological knowledge obtained from ontologies such as the Gene Ontology (GO) can be integrated with low level information extracted from a Bayesian net- work trained on protein interaction data. We can automatically generate a biological ontology by text mining the type II diabetes research literature. The ontology is populated with the enti- ties and relationships from protein-to-protein interactions. New, previously unrelated information is extracted from the growing body of research literature and incorporated with knowledge al- ready known on this subject from the gene ontology and databases such as BIND and BioGRID. We integrate the ontology within the probabilistic framework of Bayesian networks which enables reasoning and prediction of protein function. 1 Introduction The large amounts of genomic and proteomic data that are generated by biological experiments is now enabling deeper insights into cellular and molecular function. New technologies such as microarrays and electrophoresis gels are providing vast quantities of experimental data at unprece- dented rates. All of this information needs to be stored and carefully annotated. With each new experiment providing details of new protein-to-protein interactions, new biological pathways and new genes it is essential that these discoveries are made available to the scientific community. To this end, online scientific databases are now in place that disseminate these results. These databases such as the popular Gene Ontology (GO) are updated at intervals to reflect the latest developments [1]. The updating is done by experts who manually revise each entry by reading the research liter- ature and annotating the database collections accordingly. Unfortunately, hand annotation is a slow process and the databases are lagging behind the experimental work by a considerable margin. Our particular research area is that of diabetes, in particular the effects of insulin resistance on protein expression and insulin regulated protein trafficking in fat cells. In recent years there has been a dramatic worldwide increase of those suffering with diabetes. In the year 2000, there were 171 million cases and by 2030 the World Health Organization (WHO) has predicted there will be 366 million people suffering from this condition (www.who.int/diabetes/f acts/). The WHO data is for diagnosed cases but the undiagnosed cases are estimated by the WHO at 14.6 million alone for the US. In this paper we present our results of how we automatically generate a viable ontology based on information extraction of keywords from the research literature. The keywords define the enti- ties and relationships of important genes, gene relationships, protein-to-protein interactions operate 1