Colour image retrieval based on DCT-domain vector quantisation index histograms Z.-M. Lu and H. Burkhardt A new kind of feature for colour image retrieval based on DCT-domain vector quantisation (VQ) index histograms (DCTVQIH) is proposed. For each colour image in the database, 12 histograms (four for each colour component) are calculated from 12 DCT-VQ index sequences, respectively. The retrieval simulation results show that, compared with the traditional spatial-domain colour-histogram-based features, the proposed features can largely improve the recall and precision perfor- mance. Introduction: In a content-based image retrieval (CBIR) system [1], instead of being manually annotated by keywords, images are indexed by their own visual content, such as colour [2], texture [3] and shape [4], which are more essential and closer to the human perceptual system than the keywords used in a text-based image retrieval system. Because of the natural classiﬁcation and clustering characteristics of vector quantisation (VQ) [5], researchers [6–8] have presented some image retrieval schemes based on VQ. Reference [6] uses tree- structured vector quantisation (TSVQ) to organise the feature space as a tree. Then, the search is made by a branch and bound technique on this tree. Reference [7] extracts features directly from the code- word indices of the spatial-domain VQ compressed image. Reference [8] extracts features from the individual codebook generated from the image. Other than traditional VQ-based retrieval schemes, this Letter presents a new kind of features based on DCT-VQ. DCT-VQ index histograms: As is known, VQ is an efﬁcient clustering and classiﬁcation technique for high-dimensional spaces. In the spatial-domain VQ, a representative codebook should be generated ofﬂine based on a large training vector set using the well-known LBG algorithm [9] before online encoding. During the encoding process, the image is ﬁrst divided into blocks, each block being an input vector. For each input vector, we search in the codebook the nearest codeword for it. Then we use the codeword index to represent the input vector. During the decoding process, we only need a simple table-look-up procedure to obtain the corresponding reproduction vector from the codebook based on the index. To obtain high compression quality, the codebook should be suitable for various images, and thus the codebook size should be large enough. However, the encoding complexity increases with the codebook size. To reduce the complexity, one solution is to perform the VQ compression in transform domains, such as DCT and DWT. DCT has the excellent energy compact property, thus we can throw the high-frequency information and only perform VQ on the low-frequency coefﬁcients. It is proved that DCT-VQ [10] can obtain better performance than the spatial-domain VQ. On the other hand, a histogram can graphically summarise the distribution of a univariate data set and show the centre (i.e. the location) of the data, spread (i.e. the scale) of the data, skewness of the data, presence of outliers and presence of multiple modes in the data, and these features provide strong indications of the proper distributional model for the data. Thus, in this Letter, we utilise the normalised histograms, which are invariant to translation and nearly invariant to rotation and scaling. Considering the above two aspects, we present the features based on DCT-VQ index histograms, which can be described as follows. (1) Codebook generation. Obviously, we should ﬁrst generate a representative codebook for the image database. We randomly select a certain number of images from the database to be the training images. Note that we use the YCbCr colour space to denote each colour image. In the following description, we only give an example for one of three components. We divide each image into blocks of size 8  8. All of these blocks comprises the set O ¼ {o 1 , o 2 , ... , o N }, where N is the number of blocks. Then DCT is performed on each block in O to obtain the transformed set T ¼ {t 1 , t 2 , ... , t N }. Here, we use a vector to denote each transformed block, i.e. we rearrange the transformed DCT block from the two-dimensional array to the one-dimensional array in the zig-zag sequence. We divide each transformed block t into four parts, the ﬁrst part is the DC coefﬁcient d, the second part is composed of 16 low-frequency coefﬁcients denoted by the vector l, the third part is composed of nine middle-frequency coefﬁcients denoted by the vector m and the last part is composed of the remaining high-frequency coefﬁcients denoted by the vector h. Because high-frequency coefﬁ- cients are relatively of smaller value, we can discard them during the compression process. However, in this Letter, we deal with it in another way by computing the energy e of all high-frequency coefﬁcients. Thus, we can compose four training sets, the DC set D ¼ {d 1 , d 2 , ... , d N }, the low-frequency set L ¼ {l 1 , l 2 , ... , l N }, the middle-frequency set M ¼ {m 1 , m 2 , ... , m N } and the high-frequency energy set E ¼ {e 1 , e 2 , ... , e N }, where d i is a one-dimensional vector, l i is a 16-dimensional vector, m i is a nine-dimensional vector and e i is a one-dimensional vector, 1  i  N. Based on these four training sets, we generate four corresponding codebooks based on the LBG algorithm [9], respectively. For the set D, we generate the codebook C D with N 1 codewords. For the set L, we generate the codebook C L with N 2 codewords. For the set M, we generate the codebook C M with N 3 codewords. For the set E, we generate the codebook C E with N 4 codewords. Note that there are three colour components, thus we have 12 codebooks, four for each colour component. Fig. 1 Partition of each DCT block (2) Feature extraction. After obtaining the above 12 codebooks, we can then encode each image in the database with them. For any input image based on a certain colour component, we divide it into blocks of size 8  8, and then perform DCT on each block. We then rearrange and divide each DCT block into four parts as shown in Fig. 1, and then encode each part with the corresponding codebook to obtain the corresponding codeword index. For each part, we collect codeword indices of all DCT blocks to get an index sequence. Thus, we can obtain 12 index sequences in total for each image, four for each colour component. Then we calculate the histogram for each index sequence to compose the features for each image. Note that here the DC index histogram can reﬂect the rough information, and the high-frequency energy index histogram can reﬂect the texture information of the image. Fig. 2 Comparison of precision performance Experimental results and conclusions: To demonstrate the efﬁciency of the proposed features, we compare our DCTVQIH features with traditional spatial-domain colour-histogram-based (SCH) features. We use a standard database [11] in the experiment that is carried out on a Pentium IV computer with a 2.80 GHz CPU. This database includes 1000 images of size 384  256 or 256  384, which are classiﬁed into ten classes, each class including 100 images. We ﬁrst randomly select two images from each class to be the training images, then we perform DCT on all 8  8 blocks and compose four training sets, and then generate four codebooks for each colour component, C D with 64 codewords, C L with 512 codewords, C M with 256 codewords and C E with 128 codewords. Based on these codebooks, we encode each image to get 12 index sequences, and then we use a 16-bin histogram to represent each index sequence, thus we can get a 192-dimensional feature vector for each colour image in the database. We also extract three 64-bin colour histograms (i.e. a 192-dimensional feature vector) from each image based on the YCbCr colour space for comparisons. To compare the performance more reasonably, we randomly select ELECTRONICS LETTERS 18th August 2005 Vol. 41 No. 17