LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT Meriem Bendris 1,2 , Delphine Charlet 1 , G´ erard Chollet 2 1 France T´ el´ ecom R&D - Orange Labs, France 2 CNRS LTCI, TELECOM-ParisTech, France {meriem.bendris,delphine.charlet}@orange-ftgroup.com gerard.chollet@telecom-paristech.fr ABSTRACT Our objective is to index people in a TV-Content. In this context, because of multi-face shots and non-speaking face shots, it is difﬁcult to determine which face is speaking. There is no guaranteed synchronization between sequences of a per- son’s appearance and sequences of his or her speech. In this work, we want to separate talking and non-talking faces by detecting lip motion. We propose a method to detect the lip motion by measuring the degree of disorder of pixel directions around the lip. Results of experiments on a TV-Show database show that a high correct classiﬁcation rate can be achieved by the proposed method. Index Terms— Audiovisual identity indexing, video search, visual speaker detection. 1. INTRODUCTION With the increase of internet use, we see a proliferation of multimedia content (Video On Demand, TV websites in- terfaces). While there are many available technologies cap- turing and storing of multimedia content, technologies to fa- cilitate access and manipulation of multimedia data need to be developed. One way of browsing this type of data is to use audio-visual indexing of people, allowing a user to lo- cate sequences of a certain person. In our study, we focus par- ticularly on audio-visual indexing of people in popular TV- programs. Identifying people in this video context is a difﬁ- cult problem due to many ambiguities in audio, in video and in their association. First, concerning the audio, the speech is spontaneous, shots are very short and often people are spea- king simultaneously. Secondly, concerning the visual infor- mation, faces appear with many variations in lighting condi- tions, position and facial expressions. Finally, associating au- dio and visual information in this context introduces many ambiguities. The main one is the asynchrony between se- quences of speech and face appearance of a person. Then, it is difﬁcult to determine which face is speaking in the cases of multi-faces shots or shots where the speaker face is not detec- ted (not visible). In this work, the objective is to detect whether each face in a video shot is speaking or not using visual information in order to associate the correct face to the speaker. We chose to accomplish this by detecting lip activity. In the literature, there are several methods which study visual information of speech activity to improve speech recognition systems [1, 2] and audio-visual synchrony [3]. Most of these methods re- quires a high level of lips representation. Few authors have focused only on detecting mouth activity to localize the spea- ker. In our context, because of the variability of the face appearance, it is very difﬁcult to extract the shape of lips with great reliability. In [4, 5] the authors use a difference between mouth region to detect lip activity. Our contribution in this work is to develop a lip activity detection using the disorder of the directions of pixel. This paper is organized as follows : in section 2, a system of lip activity detection in TV-context is proposed. Section 3 presents brieﬂy the TV-Show database used to perform our experiments. Finally, experiments are reported in section 4 2. LIP ACTIVITY DETECTION Our objective is to detect a lip activity in order to classify faces as talking/non talking in TV-Context. The ﬁrst chal- lenge is to identify the information to be extracted to detect the lip activity. In the domains of lip reading, synchrony and visual speech speaking detection, there are 2 types of mouth region representations : grey-level information [1, 3] and high level visual information (geometrical) like lip width, height, surface, mouth opening [2]. In [1], the authors com- bine acoustic features with visual features represented by lip contour and gray scale of the mouth region in order to improve the speech recognition performance. In [2], visual speech parameters are represented by the outer and inner lip width, outer and inner height and lip surface. In [3], a Dis- crete Cosine transform (DCT) coefﬁcients of the grey-level lip area extracted and combined with MFCC coefﬁcients to measure the audio-visual synchrony. The high level features are not appropriate in a TV-Context because it is very difﬁcult