The UVigo-GTM Speaker Diarization System for the Albayzin’10 Evaluation Laura Docio-Fernandez, Paula Lopez-Otero, Carmen Garcia-Mateo Department of Signal Theory and Communications, Universidade de Vigo ldocio@gts.uvigo.es, plopez@gts.uvigo.es, carmen@gts.uvigo.es Abstract In this paper, the system submitted by the UVigo-GTM for the Albayzin 2010 Speaker Diarization Evaluation is described. This system is built upon the speaker segmentation system pre- sented in our ICASSP 2010 paper. Specifically, the system uses a poisson-based false alarm reduction strategy. Then, the speaker segmentation strategy assumes that the occurrence of changes constitute a Poisson process, so changes will be dis- carded with a probability that follows a Poisson cumulative den- sity function. The speaker clustering step we use an agglom- erative clustering approach in which the speech segments are merged until reaching a stopping point. Index Terms: speaker segmentation, speaker clustering, cluto 1. Introduction Nowadays, an emerging application area where speech tech- nologies are involved is the field of structuring the informa- tion of multimedia (audio-visual) documents. These multime- dia documents are, in general, multi-speaker audio recordings, and for some applications it may be relevant to determine “who spoke when”. This task is also referred to as “speaker seg- mentation and clustering” or “speaker diarization” in the lit- erature. The segmentation of the data in terms of speakers could help in efficient navigation through audio documents, such as meeting recordings or broadcast news archives. Us- ing these segmentation clues, an interested user can directly ac- cess a particular segment of the speech spoken by a particular speaker. Other applications of the speaker segmentation task include speaker adaptation in speech recognition and speaker identification-verification-tracking. The Albayzin 2010 Speaker Diarization Evaluation task fo- cuses in audio broadcast news programs. The UVigo-GTM speaker diarization system follows a two-stage speaker diariza- tion approach: a speaker segmentation stage, which detects speaker change points; and a speaker clustering stage, where the speech segments, each spoken by one speaker, are clustered using an agglomerative hierarchical strategy. In [1], an online four-step speaker segmentation system is introduced: it first performs a coarse segmentation of the data, then refines or discards the change points, discriminates be- tween speech and non-speech, and merges segments that are likely to be spoken by the same speaker. It was noticed that this baseline segmentation system has a high false alarm rate and tends to estimate short segments. In [2], two novel ap- proaches for reducing the number of false alarms, in order to avoid erroneous speaker changes, were introduced, evaluated and compared with the false-alarm discard algorithm proposed in [1]. The first approach rejects, given a discard probability, those changes that are suspicious of being false alarms because of their low ΔBIC value. The second strategy assumes that the occurrence of changes constitute a Poisson process, so changes will be discarded with a probability that follows a Poisson cu- mulative density function. The goal of such techniques is to confirm true speaker changes and suppress erroneous speaker changes. The UVigo-GTM speaker diarization system submit- ted for the Albayzin 2010 Evaluation is based on the second startegy for rejecting change-points. To accomplish the clustering task, an agglomerative hierar- chical clustering method was chosen. The clustering algorithm measures the similarity between clusters based on the similarity between pairs of speech segments. The critical elements of this clustering technique are the distance or similarity metric used to compare the speech segments, and the selection of the final number of clusters. This paper is organized as follows. Section 2 gives a brief description of the baseline speaker segmentation system. The proposed approaches to reduce the false alarm rate are presented in Section 3. In Section 4 an explanation of the experimental framework is given. The performance of the speaker segmenta- tion system using each one of the false alarm reduction strate- gies is shown and discussed in Section 5. Finally, Section 6 concludes this paper and gives some ideas of future work. 2. The speaker segmentation stage The architecture of the baseline speaker segmentation system described in [1] is depicted in Fig. 1, where it can be observed that it has four stages: first, a coarse segmentation is made with the Distance Changing Trend Segmentation algorithm (DCTS) [3], in order to detect audio change-point candidates and then a refinement or rejection of these change-point candidates is performed by the Bayesian Information Criterion (BIC) algo- rithm [4]. After that, the system makes a Maximum a Posteriori (MAP) adaptation of three different Gaussian Mixture Models (GMMs) to decide whether the audio segment delimited by the new change-point and the preceding one is speech, music or si- lence/noise. If the segment is speech, the same procedure will be employed to classify the speech in male or female speech. Finally, when the two latest segments are speech, an approach based on the Cross Likelihood Ratio (CLR) [5] test is applied in order to find out if both speech segments are spoken by the same speaker; in that case both speech segments are merged. 2.1. Poisson distributed-based false alarm rejection strat- egy The proposed strategy to discard false alarms is based on the monitoring of the ΔBIC value ΔBIC(i)= L(i) - λP (1) where P is the penalty, corresponding to the number of free pa- rameters of the Gaussian model, and λ is a weight that increases or decreases the influence of the penalty. When λ is a small FALA 2010 VI Jornadas en Tecnología del Habla and II Iberian SLTech Workshop -401-