American Journal of Applied Sciences 6 (6): 1217-1224, 2009
ISSN 1546-9239
© 2009 Science Publications
Corresponding Author: Ahmad Adel Abu Shareha, School of Computer Science, University Sains Malaysia, 11800, Penang,
Malaysia Tel: +604-6533888 Fax: +604-6573335
1217
Multimodal Integration (Image and Text) Using Ontology Alignment
Ahmad Adel Abu Shareha, Mandava Rajeswari and Dhanesh Ramachandram
School of Computer Sciences, University Sains Malaysia, 11800 Penang, Malaysia
Abstract: Problem statement: This study proposed multimodal integration method at the concept
level to investigate information from multimodalities. The multimodal data was represented as two
separate lists of concepts which were extracted from images and its related text. The concepts
extracted from image analysis are often ambiguous, while the concepts extracted from text
processing could be sense-ambiguous. The major problems that face the integration of the
underlying modalities (image and text) were: The difference in the coverage and the difference in the
granularity level. Approach: This study proposed a novel application using ontology alignment to
unify the underlying ontologies. The said lists of concepts were represented in a structured form within
the corresponding ontologies then the two structural lists are enriched and matched based on the
alignment, this matching represent the final knowledge. Results: The difference in the coverage was
solved in this study using the alignment process and the difference in the granularity level was solved
using the enrichment process. Thus, the proposed integration produced accurate integrated results.
Conclusion: Thus, integration of these concepts allows the totality of the knowledge be expressed
more precisely.
Key words: Concept-level multimodal integration, ontology alignment and semantic knowledge
INTRODUCTION
Multimodal fusion and multimodal integration
refers to the merging of different sources of information
as a means of enhancing the outcome some specific
task. The richness of the information provided by
multimodal data could potentially lead to better
performance than those tasks that rely on unimodal
data. In the natural world, human and animals in the
higher rungs of evolution perceive the world using
multiple senses concurrently and use their acquired
knowledge to analyze and to understand events. Here,
multimodal information integration takes place at a
high level using predetermined knowledge
[1,2]
. Data
used in multimodal integration can be present in
different levels of abstraction.
There have been several of multimodal integration
approaches reported in the literature that vary in their
context and application. Generally, we may categorize
these approaches into (a): Multimodal Fusion
Approaches, (b): Tightly-coupled Multimodal
Integration and (c): Augmented Unimodal Analysis. In
multimodal fusion, low-level integration of multimodal
data is the main characteristic of the approach, rich data
from a single source is divided into multiple modals for
efficient processing and finally combined for better
interpretation
[3,4]
. Tightly-coupled multimodal
integration involves data from multiple sources that are
tightly coupled (e.g., movements of the lips to words
that are being read in the speech) which is processed
independently and integrated at a high level of
abstraction. If we consider this example, both the image
and text express the same information at any given
time
[5,6]
. The multimodal integration is then performed
at a higher level of data abstraction. In, what we
classify as augmented unimodal analysis, the extraction
of knowledge is primarily based on a dominant
modality. However, to aid the analysis and the
interpretation of the subject matter of interest,
associated data from a different modality may be used.
Here, the assisted knowledge is used without any pre-
processing, hence the data from dominant modality has
to be processed and transformed to a form suitable to be
used with the assisting modality. For example, in the
research of Benitez and Chang
[7]
, perceptual knowledge
is extracted from an image and then disambiguated with
the assistance of the associated keywords. Here, the
main focus is to disambiguate image content using
textual data (keywords).
In the context of our research, the multimodal
information consists of images and the accompanying
textual descriptions of the images or any free text