Domain-Specific Term Extraction for Concept Identification in Ontology
Construction
Kiruparan Balachandran, Surangika Ranathunga
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka
kiruparan@gmail.com, surangika@cse.mrt.ac.lk
Abstract—An ontology is a formal and explicit specification
of a shared conceptualization. Manual construction of domain
ontology does not adequately satisfy requirements of new
applications, because they need a more dynamic ontology and
the possibility to manage a considerable quantity of concepts
that humans cannot achieve alone. Researchers have discussed
ontology learning as a solution to overcome issues related to
the manual construction of ontology. Ontology learning is
either an automatic or semi-automatic process to apply
methods for building ontology from scratch, or enriching or
adapting an existing ontology. This research focuses on
improving the process of term extraction for identifying
concepts in ontology learning. Available approaches for term
extraction process are limited in various ways. These
limitations include: (1) obtaining domain-specific terms from a
domain expert as seed words without automatically
discovering them from the corpus, and (2) unsuitable usage of
corpora in discovering domain-specific terms for multiple
domains. Our study uses linguistic analysis and statistical
calculations to extract domain-specific simple and complex
terms to overcome this first limitation. To eliminate the second
limitation, we use multiple contrastive corpora that reduce the
biasness in using a single contrastive corpus. Evaluations show
that our system is better at extracting terms when compared
with the previous research that used the same corpora.
Keywords - term extraction; ontology learning; non-
taxonomic relations; taxonomic relations; concept
I. INTRODUCTION
Ontology is a formal and explicit specification of a
shared conceptualization. Ontology should be readable and
understandable by a software agent, and the constructed
ontology should be verified and accepted by relevant
domain-experts and the community. An objective of
ontology is to eliminate the conceptual and terminological
confusion in a specific community. An ontology consists of
a set of concepts, set of relations, set of rules, and instances
of concepts (also referred to as terms). Terms can be simple
or complex. For example, consider two terms “randomized
algorithm” and “program” in the Computer Science domain.
Here, “randomized algorithm” is referred to as a complex
term. It contains more than one word to form a term,
whereas “program” can be considered a simple term.
The output of a manually constructed ontology depends
wholly on the domain expert’s viewpoints, assumptions, and
needs regarding that domain. However, since these three
factors differ according to each domain expert, this leads to
inconsistent ontologies. Researchers have discussed ontology
learning as a solution to overcome issues related to the
manual construction of ontologies. Ontology learning is
either an automatic or a semi-automatic process that applies
methods for building ontology from scratch, or enriching or
adapting an existing ontology.
According to Buitelaar et al. [1], ontology learning
process can be organized into a layer cake with the
following modules: (1) Extracting domain-specific terms,
(2) Finding the synonyms for identified terms, (3)
Discovering concepts, (4) Extracting taxonomic
relationships, (5) Extracting non-taxonomic relationships,
and (6) Extracting rules from text to validate the discovered
ontology. To discover a valid ontology from a corpus for a
given domain, it requires improving each step in the
ontology learning process.
In this study, we focus on improving the domain-specific
term extraction process. Available approaches for term
extraction process are limited in various ways. Most
existing approaches assume that the domain expert feeds
domain-specific terms to ontology learning process [2].
Further, the automated approaches that use an available
corpus to extract domain-specific terms are not efficient.
Previous research has defined two types of corpora [3], [4],
[5]:
Target domain corpus: the corpus from where the
domain-specific terms are extracted. The corpus is
dedicated to one domain.
Contrastive corpus: one corpus created by
combining other domain corpora, except the target
domain.
In existing approaches, a term can be considered as a
domain-specific term if the term has more influence on the
target domain compared to the contrastive corpus. However,
to extract domain-specific terms, we need to identify unique
terms for each domain. To do so, influence of each term on
2016 IEEE/WIC/ACM International Conference on Web Intelligence
978-1-5090-4470-2/16 $31.00 © 2016 IEEE
DOI 10.1109/WI.2016.16
34