An Adaptive Ontology based Hierarchical Browsing System for CiteSeer
x
Nanhong Ye
CSCE Department
University of Arkansas
Fayetteville, USA
Email: nye@uark.edu
Susan Gauch
CSCE Department
University of Arkansas
Fayetteville, USA
Email: sgauch@uark.edu
Qiang Wang
CSCE Department
University of Arkansas
Fayetteville, USA
Email: qxw002@uark.edu
Hiep Luong
CSCE Department
University of Arkansas
Fayetteville, USA
Email: hluong@uark.edu
Abstract—As an indispensable technique in addition to
the field of Information Retrieval, Ontology based Retrieval
System (or Browsing Hierarchy) has been well studied and
developed both in academia and industry. However, most of
current systems suffer the following problems: (1) Constructing
the mappings between documents and concepts in ontology
requires the training of robust hierarchical classifiers; it’s
difficult to build such classifiers for large-scale documents
corpus due to the time-efficiency and precision issues. (2)
The traditional Browsing Hierarchical System ignores the
distribution of documents over concepts, which is not realistic
when a large number of documents distributed biasly on certain
concepts. Browsing documents such concepts becomes time-
consuming and unpractical for users. Therefore, further split-
ting these concepts into sub-categories is necessary and critical
for organizing documents in the browsing system. Aiming at
building the Hierarchical Browsing System more realistically
and accurately, we propose an adpative Hierarchical Browsing
System framework in this paper, which is designed to build
a Browsing Hierarchy for CiteSeer
x
. In this framework, we
first investigate the supervised learning approaches to classify
documents into existing predefined concepts of ontology and
compare their performance on different datasets of CiteSeer
x
.
Then, we give a empirical analysis of unsupervised learning
methods for adding new clusters to the existing browsing
hierarchy. Experimental analysis on CiteSeer
x
corpus shows
the effectiveness and the efficiency of our method.
Keywords-Ontology; Browsing System; Unsupervised Learn-
ing;
I. I NTRODUCTION
As the exponential growth of information generated on the
World Wide Web, the Information Retrieval techniques like
Browsing System have become more and more important
. Different from Ad-Hoc Information Retrieval, searching
information by browsing provides another perspective of
information retrieval technique. Typically, a browsing sys-
tem is often associated with an ontogloy —a hierarchi-
cal structure of concepts — that represents a domain of
knowledge system. In practice, building a ontology for
intelligent system involves domain-specific experts’ effort
to manually identify a set of representational primitives and
integrate them iteratively into ontology system. For instance,
several applications of ontology based search and browse
system [1,4] are created in this manner, and constructed and
maintained by vast community of volunteer editors. Growing
and maintaining ontology is a challenging problem for the
reason that formal ontology engineers must keep them-
selves updated with extensive domain-specific knowledge
and complid with existing ontology. Futhermore, general
cross-domain ontologies such as Open Directory Project [4]
and Wikipedia are difficult to keep logically consistent due
to heterogeneous structures of knowledge conceptualization
from different group of domain expert developers.
Another issue with ontology engineering is the techniques
for automatically performing ontology mapping between
documents and concepts. Ontology, in essence, is concern
with the classification and categorization of real objects,
not only with the concepts themselves. Today, with the
exponential growth of the available information on the World
Wide Web, ontology engineer has difficulty in meeting
efficiency and effectiveness performance demanded by users
searching for relevant information under specific concepts.
For example, network protocol C.2.2, as a research category
in ACM Computing Classification System [1], has been
extensively studied and derived more than 70 protocols
in different layers of OSI model. While it is dynamically
growing biger and biger, finding relevent documents related
to certain topics becomes a labored and time-consuming
work for users.
To overcome these difficulties, in this paper, we propose
a new framework for building an Ontology based Browsing
and Search System. First, by integrating our previous work
KeyConcept[2], we construct the mappings between docu-
ments and existing ontology. Then we investigate several
unsupervised clustering methods and use them to further
split heavy burdened categories. The experimental results
on CiteSeer
x
corpus show that our method can scale on
large documents collection and provide a more robust way
to construct a browsing System. The remainder of this paper
is organized as below. In section 2, we provide a overview
of several major approaches for ontology extension and
some related work. Section 3 presents the algorithms of our
model in details . The experimental results and evaluation
are presented in Section 4 , followed by the conclusion and
future work in Section 5.
2010 Second International Conference on Knowledge and Systems Engineering
978-0-7695-4213-3/10 $25.00 © 2010 IEEE
DOI 10.1109/KSE.2010.32
203