Domain-Specific Term Extraction for Concept Identification in Ontology Construction Kiruparan Balachandran, Surangika Ranathunga Department of Computer Science and Engineering University of Moratuwa Sri Lanka kiruparan@gmail.com, surangika@cse.mrt.ac.lk Abstract—An ontology is a formal and explicit specification of a shared conceptualization. Manual construction of domain ontology does not adequately satisfy requirements of new applications, because they need a more dynamic ontology and the possibility to manage a considerable quantity of concepts that humans cannot achieve alone. Researchers have discussed ontology learning as a solution to overcome issues related to the manual construction of ontology. Ontology learning is either an automatic or semi-automatic process to apply methods for building ontology from scratch, or enriching or adapting an existing ontology. This research focuses on improving the process of term extraction for identifying concepts in ontology learning. Available approaches for term extraction process are limited in various ways. These limitations include: (1) obtaining domain-specific terms from a domain expert as seed words without automatically discovering them from the corpus, and (2) unsuitable usage of corpora in discovering domain-specific terms for multiple domains. Our study uses linguistic analysis and statistical calculations to extract domain-specific simple and complex terms to overcome this first limitation. To eliminate the second limitation, we use multiple contrastive corpora that reduce the biasness in using a single contrastive corpus. Evaluations show that our system is better at extracting terms when compared with the previous research that used the same corpora. Keywords - term extraction; ontology learning; non- taxonomic relations; taxonomic relations; concept I. INTRODUCTION Ontology is a formal and explicit specification of a shared conceptualization. Ontology should be readable and understandable by a software agent, and the constructed ontology should be verified and accepted by relevant domain-experts and the community. An objective of ontology is to eliminate the conceptual and terminological confusion in a specific community. An ontology consists of a set of concepts, set of relations, set of rules, and instances of concepts (also referred to as terms). Terms can be simple or complex. For example, consider two terms “randomized algorithm” and “program” in the Computer Science domain. Here, “randomized algorithm” is referred to as a complex term. It contains more than one word to form a term, whereas “program” can be considered a simple term. The output of a manually constructed ontology depends wholly on the domain expert’s viewpoints, assumptions, and needs regarding that domain. However, since these three factors differ according to each domain expert, this leads to inconsistent ontologies. Researchers have discussed ontology learning as a solution to overcome issues related to the manual construction of ontologies. Ontology learning is either an automatic or a semi-automatic process that applies methods for building ontology from scratch, or enriching or adapting an existing ontology. According to Buitelaar et al. [1], ontology learning process can be organized into a layer cake with the following modules: (1) Extracting domain-specific terms, (2) Finding the synonyms for identified terms, (3) Discovering concepts, (4) Extracting taxonomic relationships, (5) Extracting non-taxonomic relationships, and (6) Extracting rules from text to validate the discovered ontology. To discover a valid ontology from a corpus for a given domain, it requires improving each step in the ontology learning process. In this study, we focus on improving the domain-specific term extraction process. Available approaches for term extraction process are limited in various ways. Most existing approaches assume that the domain expert feeds domain-specific terms to ontology learning process [2]. Further, the automated approaches that use an available corpus to extract domain-specific terms are not efficient. Previous research has defined two types of corpora [3], [4], [5]:  Target domain corpus: the corpus from where the domain-specific terms are extracted. The corpus is dedicated to one domain.  Contrastive corpus: one corpus created by combining other domain corpora, except the target domain. In existing approaches, a term can be considered as a domain-specific term if the term has more influence on the target domain compared to the contrastive corpus. However, to extract domain-specific terms, we need to identify unique terms for each domain. To do so, influence of each term on 2016 IEEE/WIC/ACM International Conference on Web Intelligence 978-1-5090-4470-2/16 $31.00 © 2016 IEEE DOI 10.1109/WI.2016.16 34