MultiBIC: an Improved Speaker Segmentation Technique for TV Shows Paula Lopez-Otero, Laura Docio-Fernandez, Carmen Garcia-Mateo Department of Signal Theory and Communications, Universidade de Vigo, Spain {plopez, ldocio, carmen}@gts.tsc.uvigo.es Abstract Speaker segmentation systems usually have problems detecting short segments, which causes the number of deletions to be high and therefore harming the performance of the system. This is a complication when it comes to segmenting multimedia infor- mation such as movies and TV shows, where dialogs among characters are very common. In this paper a modiﬁcation of the BIC algorithm is presented, which will reduce remarkably the number of deletions without causing an increase in the number of false alarms. This modiﬁcation, referred to as MultiBIC, as- sumes that two change-points are present in a window of data, while conventional BIC approach supposes that there is just one. This causes the system to notice when there is more than one change-point in a window, ﬁnding shorter segments than tradi- tional BIC. Index Terms: speaker segmentation, TV shows, miss detection reduction 1. Introduction A recent application where speaker segmentation systems are found to be useful is the processing of multimedia databases. Nowadays, with the widespread use of digital audio and video, the possibilities for processing and manipulating this kind of data are diverse. In many multimedia applications, it is interest- ing to be able to use the data to extract some information that allows moving through the contents of the audio and video in order to ﬁnd something concrete. For example, laughter detec- tion [1][2] can help to identify which parts of a movie are funny, speaker detection and tracking [3][4] ﬁnds out when a particular actor or character is talking, and combining speaker identiﬁca- tion and speaker segmentation allows identifying dialogs [5]. To perform those tasks and similar ones it is important to have a good segmentation of the data to be processed. There are lots of publications about speaker segmentation systems [6][7], being the Bayesian Information Criterion (BIC) algorithm widely employed [8][9][10]. The performance of these systems is good, but it has to be taken into account that not all the speaker and audio segmentation systems need to confront the same problems. In our previous research on speaker seg- mentation [11][12], we have focused on audio data extracted from European Parliament sessions which were recorded and annotated within the EU project “Technology and Corpora for Speech to Speech Translation”(TC-STAR) [13]. This database is characterized by long audio segments spoken by the same speaker that should not be broken. The existing conventional speaker segmentation systems have a high false alarm rate on such database. In order to solve that problem, statistical pro- cesses were used in [11][12] to reduce the number of false alarms. However, when data come from TV shows and movies, where a more natural and spontaneous talk is found, the dom- inant speech are dialogs, where it is common to have several speakers speaking in very short turns. The problem in this case is the opposite of the one found on the Parliament sessions: as the segments to be detected are short, the number of miss de- tections will be very high. Nevertheless, it is important that the reduction in the number of deletions does not imply an increase in the number of false alarms, which would solve a problem but cause a new one. In order to achieve these aims a novel BIC-based approach is proposed, which assumes that a given window of audio data can contain two change-points instead of just one, thereby detecting shorter segments and making this ap- proach more desirable when the audio to be segmented comes from a TV show. The rest of the paper is organized as follows. Section 2 de- scribes the BIC Algorithm. Section 3 presents the new segmen- tation approach, referred to as MultiBIC. In Sections 4 and 5 the experimental framework and results are given. Finally, Section 6 summarizes the conclusions and discusses future work in this ﬁeld. 2. The BIC algorithm The BIC algorithm is broadly used in speaker segmentation sys- tems. It is a model selection criterion that maximizes the log- likelihood penalized by the complexity of the model [6]. Given a window of data X =(x 1, ..., xL) of length L frames extracted from an audio stream, it is divided into two sub-windows X 1 = (x 1 , ..., x i ) and X 2 = (x i+1 , ..., x L ) (where i is the i th frame in the window X) and a hypothesis test is run in order to select one of the two following possibili- ties: • H 0: Both sub-windows X1 and X2 are modeled by the same model M. • H 1: Sub-windows X1 and X2 are modeled by two dif- ferent models M 1 and M2. This hypothesis test will be run splitting the window X at dif- ferent frames (different values of i). Given a resolution r, the value of i will change according to it, and the frame i that is the best candidate for a changing point has to be found by evaluat- ing the following expression: ΔBIC(i)= L i - λP (1) where P : P = 1 2 MlogL = 1 2 (d + 1 2 d(d + 1))logL (2) is the penalty function as stated by Schwarz [14], where M corresponds to the number of free parameters of the Gaussian model, and λ is a weight that increases or decreases the inﬂu- ence of the penalty. When λ is a small value, less changes will be discarded by the BIC algorithm; the opposite happens when Copyright  2010 ISCA 26 - 30 September 2010, Makuhari, Chiba, Japan INTERSPEECH 2010 2670 10.21437/Interspeech.2010-708