1 Query Rewriting for Multimedia XML Data Jong P. Yoon, Alaaeldin Hafez, and Vijay Raghavan The Center for Advanced Computer Studies University of Louisiana, Lafayette, LA 70504-4330 ABSTRACT Extensible Markup Language (XML) is emerging as a standard for representing and exchanging data in a variety of applications, each with its own special needs. It is, therefore, natural to explore the use of XML to represent multimedia data. While it is not difficult to customize XML for multimedia data, effective retrieval of information from a collection of multimedia document collection is not straightforward and a search may often result in either too many or too few hits. To cope with this problem, we propose a framework that enables retrieval from a multimedia document collection to be performed by rewriting user-given queries. We show how user-given queries, specified using typical querying interfaces, and several types of domain-specific rules may be represented in an XML tag- embedded query language. Using a document type definition (DTD) suitable for multimedia data and domain-specific rules, strategies for relaxing queries, based on input from the user with respect to the size and/or quality of result sets deemed acceptable, are provided. Finally, we propose suitable measures, termed accuracy and coverage, in order to evaluate the quality of rewritten queries. The contributions of this paper include mechanisms for (1) XML representation of multimedia queries and rules, (2) multimedia query rewriting, and (3) evaluation of rewritten queries. Keywords: Adaptive query rewriting, Accuracy and Coverage of rewritten queries, Multimedia XML, Approximate information retrieval. 1. INTRODUCTION Recently, EXtensible Markup Language (XML) has become an emerging standard for representing and exchanging multiple types of data. In XML, the document type definition (DTD) information is stored with the data. XML allows user-defined elements, nested elements, and an optional validation of document structure with respect to a DTD. Element names are called tags, and elements may also have attributes whose values are always atomic. In this paper, we focus on XML’s application to multimedia data. Multimedia data can be easily represented in XML. Multimedia XML Data (MXD) is therefore self-describing data. However, multimedia data retrieval is not straightforward and a query may obtain either too many, or too few hits occasionally. EXAMPLE 1.1: Consider the following MXD example. <doc> multimedia document <title> Blue Sky </title> <author> James Smile </author> <author> Johns Tommy </author> <image> Lab //cs lab <image> <!-- no information is given at this level of elements ---> <image> Keyboard </image> <!-- component elements has data as such --> <image> Monitor </image> <image> Body </image> <image> Mouse </image> </image> </image> </doc> The root element multimedia document above contains the elements of title, authors, and the image of the lab. The “Lab” image element in turn contains the subcomponent image elements of “Keyboard,” “Monitor,” “Body,” and “Mouse.” We call this “container-only element.” This container-only element is common in (especially multimedia) document collections. It may cause some difficulty in retrieving information efficiently and effectively. Suppose that a query is posed to display computers with mice where they are in the lab. The multimedia document as stated above is not matched with the query because the element image has no value of “Computer” exists. However, if subcomponent images of computers are used for query evaluation, the given multimedia document can be one of the matches. If adaptively matched, subcomponent images should be retrieved and composed to display for a requested image. To match with such container-only elements, this paper proposes a method of query rewriting. g Query rewriting has been developed for conventional databases. This earlier work has the following limitations in employing it for multimedia applications. Difficulty in dealing with multiple features retained in multimedia data. Multimedia data consists of multiple types of data and is tagged by nested elements. Typical approaches are not suitable for dealing with documents tagged by nested elements (or especially container-only elements).