Content-based Multimedia Data Retrieval on Heterogeneous System Environment Sanan Srakaew, Nikitas A. Alexandridis, Punpiti Piamsa-nga, George Blankenship Department of Electrical Engineering and Computer Science, George Washington University, Washington D.C. 20052. {srakaew, alexan, punpiti, blankeng}@seas.gwu.edu Abstract In this paper, we propose a static data partitioning scheme for content-based multimedia data retrieval using a heterogeneous cluster system. Multimedia data is represented by unified k-tree structure of k- dimensional(k-d) signals proposed in [6]. Each dimension of k-d data is separated into small blocks and then formed into a hierarchical multidimensional tree structure, called a k-tree. The parallel version of k-tree model was introduced in [7]. The previous experimental results show the huge reduction of retrieval time on a cluster of homogeneous workstations. In this paper, we extend our parallel model to a heterogeneous cluster system environment, by taking into consideration the system characteristics such as computational time, input/output time, available storage, and communication latency. The experiments of the model with load balancing shows a significant reduction of retrieval time while maintaining the quality of perceptual results. Keywords: Data partitioning, Image databases, Content- based retrievals. 1 Introduction Multimedia databases have become more important since the demand for multimedia information (such as text, audio, image and video) has increased. Currently content-based retrieval of multimedia data is being actively researched. However, content-based retrieval of multimedia data encounters three major difficulties. First, the content is subjective; this needs a powerful set of search facilities including keywords, sounds, color, texture, spatial information and motion. Second, if a method or processing technique is designed and developed for one type of data or feature, it's usually not appropriate for others. For instance, a technique designed for indexing audio data may not be usable for image data; or, a technique developed for a color feature may not be useful for a texture feature in image and video data. Third, the usual huge size of multimedia data requires an exhaustive search. A similarity search is desirable for a multimedia database since exactly matched retrieval cannot be applied. For example, if a picture of a house is used as a query to an image database, we expect to retrieve pictures that contain similar houses in them. The comparison is not pixel by pixel between a query and the records in a database; but rather, closeness to the query. Similarity matching needs the computation of the distance between a query and each record in the database; the best match is chosen from the data set with the smallest distances. To solve these three problems, we use a mathematical model to represent the features; a k-tree model to represent the data structures of the multimedia data; and exploit parallelism to reduce the retrieval time. In this paper, color and texture are the features of interest; they represent the subjective information of the multimedia data. We use a normalization technique to generate the indices. The domain of a feature is reduced to a set of selected values from a universe of potential values for the feature. We use an identification number for each element in the reduced set [7]. When data is inserted into the system, it is converted to the selected domain. The feature is represented by a histogram. For color feature, a few colors are picked from the whole infinite universe of colors. A finite number indexes each color. The color feature of an image or a video is represented by a histogram using the indexed color. For texture feature, we selected a set of textures and assigned an identification number to each texture. The feature of a texture is represented by the histogram of texture identification, which is the same method that was used for the color feature. The comparison of two features is based upon the distance between the histograms that define the features. To reduce the response time, one may use a parallel model of a homogeneous system to perform a content- based multimedia retrieval. The experimental results were very positive in both qualitative and quantitative metrics. However, in practical, we do not have dedicated machines that always have the same configurations. The homogeneous model may be not used efficiently enough in the real-life heterogeneous environment. In this paper, we investigate a data partitioning scheme for multimedia database retrieval on a heterogeneous cluster system. We use system characteristics, such as processor speed, input/output time, and available storage, to partition data among the processors in the systems. Our computer system environment is composed of Sun Sparc and International Conference on Intelligent Systems (ICIS-99) , Denver, Colorado, June 24-26, 1999 For other papers of the PDC group at GWU, go to http://www.seas.gwu.edu/seas/eecs/Research/Parallel-Distributed