A method for semi-automatic extension of a middle-layer ontology Ulf Schwarz 1* , Holger Stenzhorn 2 , Nikolina Koleva 1 , Luc Schneider 1 and Emilio M. Sanfilippo 1 1 Institute for Formal Ontology and Medical Information Science, Saarland University, Germany 2 Saarland University Hospital, Dep. for Pediatric Oncology and Hematology, Homburg, Germany ABSTRACT We present a semi-automatic method for the integration of semantic concepts under a middle-layer ontology in the biomedical domain. For specific purposes a users might want to extend this ontology with concepts or classes necessary for a task at hand and so the middle-layer ontology must be specialized in simple dedicated modules. Our strategy for performing the extension is first to search candidate concepts in existing sources. The retrieved candidate concepts are then ranked. The ranking is realized on the one hand on the ontology level and on the other hand on the concept level. We look for a class from the middle-layer ontology that matches a super class of a candidate concept. If there is such match we generate recommendation for the integration of the candidate concept. We subscribe to the MIREOT 1 principles and apply them in our strategy. We developed an Ontology Aggregator Tool (OAT) that implements our strategy. The tool allows for the re-use of (parts) pre-existing resources and enables a user to build a custom made semantic resource. 1 INTRODUCTION Users and designers of biomedical ontologies are currently dealing with a proliferation of heterogeneous semantic resources. The plethora of ontologies contained in libraries such as NCBO’s BioPortal (Musen et al. (2012), Whetzel et al. (2011)) illustrate this issue, with more than five millions terms gathered and grouped in more than three hundred often overlapping, yet mostly unrelated resources. A natural question that arises at this point is how are those heterogenous classes related. Our attempt for answering this question is the development of Health Data Ontology Trunk 2 (HDOT). HDOT integrates parts of semantic resources for a larger domain (in our case: the overall biomedical domain) under a middle-layer ontology. It has available a set of separate extendable modules. The modules represent distinct parts of the envisaged domain in a way that optimally equilibrates expressivity and scalability. At the same time it ensures ontological consistency in the process of extending the umbrella toward more specialized classes. Since HDOT is a middle-layer ontology it does not contain specific concepts. However, this might be of interest to its user and at this point the OAT comes into play. OAT is used for the semi-automatic extension of HDOT. Thereby parts of pre- existing semantic resources are re-used. We aim at integrating not only previously established and standardized terms but also well established class identifiers, i.e. URIs, whenever possible. The OAT is implemented in Java and thus is platform independent tool. * To whom correspondence should be addressed: Ulf Schwarz ulf.schwarz@ifomis.uni-saarland.de 1 http://obi-ontology.org/page/MIREOT 2 https://code.google.com/p/hdot/ 2 THE ALGORITHM 2.1 Sorting the resources There are many portals and a huge amount of biomedical ontologies. For the first phase of the development, we decide to use NCBO BioPortal for searching an appropriate candidate. We specify a list of prima facie suitable ontologies to be the source of classes proposed for the integration under HDOT. In order to be able to consider new ontologies that could be included in the BioPortal repository, we define criteria for sorting all available resources in BioPortal. Consequently, the search results are sorted with respect to the semantic resource they are retrieved from. In addition, we apply eight more criteria for comparison on the ontology level, namely: 1. contained in pre-defined list of acceptable ontologies; 2. contained in OBO Foundry 3 ; 3. date release after 2008-12-31; 4. the resource is not flat; 5. the resource is not only metadata; 6. author of classes is specified; 7. classes are documented; 8. depth of the hierarchy; 9. no classes with one subclass. This is a way to ensure that the quality of HDOT will not be diluted. In further stages of the development we would like to add more criteria to refine the sorting of the resources. 2.2 Finding candidate-classes for integration As a backbone of our tool we use OntoCat (Adamusiak et al. (2011)). OntoCat provides a programming interface with high level of abstraction. The project can be used to query public ontology repositories via REST web services. As soon as OAT is called the search in BioPortal starts. The search primarily, but not exclusively, uses the term given by the user and delivers a list with hits. By hit we mean here a search result, i.e. a class, whose label or URI matches the searched term. The received results are then restricted with respect to the similarity 4 of the query and the searched term. 2.3 Rootpath extraction The next step is the root path(s) extraction, i.e. the path from the found class to the root class in a given ontology. We extract the root path(s) of the n-best hits contained in the restricted list of 3 http://www.obofoundry.org/ 4 We experimented with different thresholds for the similarity and observed that 90% gives good filtering. The similarity measure used is the Levenshtein distance. 1 133