American Journal of Applied Sciences 6 (6): 1217-1224, 2009 ISSN 1546-9239 © 2009 Science Publications Corresponding Author: Ahmad Adel Abu Shareha, School of Computer Science, University Sains Malaysia, 11800, Penang, Malaysia Tel: +604-6533888 Fax: +604-6573335 1217 Multimodal Integration (Image and Text) Using Ontology Alignment Ahmad Adel Abu Shareha, Mandava Rajeswari and Dhanesh Ramachandram School of Computer Sciences, University Sains Malaysia, 11800 Penang, Malaysia Abstract: Problem statement: This study proposed multimodal integration method at the concept level to investigate information from multimodalities. The multimodal data was represented as two separate lists of concepts which were extracted from images and its related text. The concepts extracted from image analysis are often ambiguous, while the concepts extracted from text processing could be sense-ambiguous. The major problems that face the integration of the underlying modalities (image and text) were: The difference in the coverage and the difference in the granularity level. Approach: This study proposed a novel application using ontology alignment to unify the underlying ontologies. The said lists of concepts were represented in a structured form within the corresponding ontologies then the two structural lists are enriched and matched based on the alignment, this matching represent the final knowledge. Results: The difference in the coverage was solved in this study using the alignment process and the difference in the granularity level was solved using the enrichment process. Thus, the proposed integration produced accurate integrated results. Conclusion: Thus, integration of these concepts allows the totality of the knowledge be expressed more precisely. Key words: Concept-level multimodal integration, ontology alignment and semantic knowledge INTRODUCTION Multimodal fusion and multimodal integration refers to the merging of different sources of information as a means of enhancing the outcome some specific task. The richness of the information provided by multimodal data could potentially lead to better performance than those tasks that rely on unimodal data. In the natural world, human and animals in the higher rungs of evolution perceive the world using multiple senses concurrently and use their acquired knowledge to analyze and to understand events. Here, multimodal information integration takes place at a high level using predetermined knowledge [1,2] . Data used in multimodal integration can be present in different levels of abstraction. There have been several of multimodal integration approaches reported in the literature that vary in their context and application. Generally, we may categorize these approaches into (a): Multimodal Fusion Approaches, (b): Tightly-coupled Multimodal Integration and (c): Augmented Unimodal Analysis. In multimodal fusion, low-level integration of multimodal data is the main characteristic of the approach, rich data from a single source is divided into multiple modals for efficient processing and finally combined for better interpretation [3,4] . Tightly-coupled multimodal integration involves data from multiple sources that are tightly coupled (e.g., movements of the lips to words that are being read in the speech) which is processed independently and integrated at a high level of abstraction. If we consider this example, both the image and text express the same information at any given time [5,6] . The multimodal integration is then performed at a higher level of data abstraction. In, what we classify as augmented unimodal analysis, the extraction of knowledge is primarily based on a dominant modality. However, to aid the analysis and the interpretation of the subject matter of interest, associated data from a different modality may be used. Here, the assisted knowledge is used without any pre- processing, hence the data from dominant modality has to be processed and transformed to a form suitable to be used with the assisting modality. For example, in the research of Benitez and Chang [7] , perceptual knowledge is extracted from an image and then disambiguated with the assistance of the associated keywords. Here, the main focus is to disambiguate image content using textual data (keywords). In the context of our research, the multimodal information consists of images and the accompanying textual descriptions of the images or any free text