Speaker Diarization Using Direction of Arrival Estimate and Acoustic Feature Information: The I 2 R-NTU Submission for the NIST RT 2007 Evaluation Eugene Chin Wei Koh 1,2 , Hanwu Sun 2 , Tin Lay Nwe 2 , Trung Hieu Nguyen 1 , Bin Ma 2 , Eng-Siong Chng 1 , Haizhou Li 2 , Susanto Rahardja 2 1 School of Computer Engineering, Nanyang Technological University (NTU), Singapore 639798 {kohc0026, nguy0059, aseschng}@ntu.edu.sg 2 Human Language Technology Department, Institute for Infocomm Research (I 2 R), Singapore 119613 {hwsun, tlnma, mabin, hli, rsusanto}@i2r.a-star.edu.sg Abstract. This paper describes the I 2 R/NTU system submitted for the NIST Rich Transcription 2007 (RT-07) Meeting Recognition evaluation Multiple Distant Microphone (MDM) task. In our system, speaker turn detection and clustering is done using Direction of Arrival (DOA) in- formation. Puriﬁcation of the resultant speaker clusters is then done by performing GMM modeling on acoustic features. As a ﬁnal step, non- speech & silence removal is done. Our system achieved a competitive overall DER of 15.32% for the NIST Rich Transcription 2007 evaluation task. 1 Introduction Speaker diarization has often been described as the task of identifying “Who Spoke When”. When done in the context of the NIST Rich Transcription 2007 (RT-07) Meeting Recognition evaluations [1], this involves indicating the start and end time of every speaker segment present in the continuous audio recording of a meeting. Segments with common speakers have to be identiﬁed and anno- tated with a single speaker identity. This paper describes our system for the RT-07 speaker diarization task for multiple distant microphone (MDM) record- ings. Speaker diarization has traditionally relied on acoustic features such as Mel Frequency Cepstral Coeﬃcient (MFCC) [2] or Perceptual Linear Prediction (PLP) [3] to perform segmentation and clustering. Segmentation is commonly done by employing the Bayesian Information Criteria (BIC) [3, 4]. Over-segment- ation typically has to be carried out in order to capture most of the speaker turns. This however poses a problem for subsequent clustering as the resulting segments will usually be of short duration and hence do not oﬀer reliable clus- tering. Our system mitigates this problem by directly using Direction of Arrival (DOA) [5] information to identify speaker transitions and perform clustering. Cluster puriﬁcation using acoustic features is then performed. Our results from the RT-07 evaluation have shown that puriﬁcation using acoustic features helps