Efficient Content-Based Retrieval of Humans from Video Databases Nikolaos Doulamis, Anastasios Doulamis and Stefanos Kollias National Technical University of Athens Department of Electrical & Computer Engineering 9, Heroon Polytechniou str. 157 73 Zografou, Athens, Greece Email: ndoulam@image.ntua.gr Abstract An efficient algorithm for humans’ retrieval from large video databases is presented in this paper. Such an extraction is very useful for a variety of applications, including video surveillance for security purposes and systems of speaker identification. A human face and body detector is first proposed, based on a simple probabilistic model, to approximately estimate human face and body regions. The adopted approach significantly reduces the required computational cost and simultaneously exploits information existing in MPEG-coded video data. A segmentation fusion scheme is then applied to improve segmentation accuracy. Based on the created segmentation map, a graph is then constructed, which represents the spatial relationship of the extracted segments. Color, texture, motion and shape characteristics are included as additional features to the nodes of the graph. To enhance the flexibility of the proposed system, each node is further decomposed into other graphs (sub-graphs) resulting in a pyramidal graph representation of the visual content. 1. Introduction In the recent years, there is an increasing amount of visual information being produced, disseminated, stored and accessed [1], [2]. This was stimulated by the rapid progress in capturing, displaying or encoding systems. However, efficient tools and algorithms for searching, retrieving or even organizing the visual content are still limited. For this reason, the International Organization for Standardization, through the Moving Picture Expert Group (MPEG), has started a new phase, called MPEG-7, in order to propose an integrated framework for visual content description interface [3]. Currently, text is used to perform the management and organization of multimedia databases. In particular, the images or video sequences are manually indexed using appropriate keywords. However, such an approach presents a number of limitations, especially for the new multimedia applications, since the rich visual content cannot be accurately described. Moreover, the inconsistency of keywords among different indexers, the different interpretation of visual content and the large amount of manual effort, which should be done, make the text-based approach not so reliable [4]. Consequently, an alternative mechanism has been recently proposed in the literature, to perform content-based image/video retrieval by exploiting the visual information, such as color, motion, texture or shape characteristics. Several works have been recently proposed in literature for content-based video indexing. In [5], an hierarchical color segmentation technique has been presented, while a shape description algorithm is analyzed in [6]. A hidden Markov model for image retrieval has been discussed in [7] and in [8] extraction of detailed images has been described. Object modeling and segmentation for indexing in video databases has been reported in [9]. Furthermore, several prototypes systems have been presented and they are now in the first stage of their commercial exploitation, such as QBIC [10], Virage [11] or VisualSEEK [12]. The majority of the previously described techniques are based on general images or video sequences and they are not restricted to specific applications. As a result, low- level features are exploited to perform the characterization of visual content. This is due to the fact that extraction of high level (semantic) properties for any kind of images or video sequences is in general a very arduous task [13]. For this reason, semiautomatic algorithms have been recently proposed in the literature for video object extraction using the users’ assistance as an initial approximation of the final segmentation [14]. In our approach, however, we are interested in humans' retrieval from video databases. Thus description of visual content can be performed more efficiently by exploiting the specific human characteristics. In particular, a human face and body detector is first used to approximately localize humans in video sequences, based on a probabilistic model. Then, the