OVERLAPPED SPEECH DETECTION FOR IMPROVED SPEAKER DIARIZATION IN MULTIPARTY MEETINGS Kofi Boakye 1 , Beatriz Trueba-Hornero 1,2 , Oriol Vinyals 1 , Gerald Friedland 1 1 International Computer Science Institute, Berkeley, CA, U.S.A. 2 Polytechnic University of Catalonia, Barcelona, Spain ABSTRACT State-of-the-art speaker diarization systems for meetings are now at a point where overlapped speech contributes signifi- cantly to the errors made by the system. However, little if no work has yet been done on detecting overlapped speech. We present our initial work toward developing an overlap detec- tion system for improved meeting diarization. We investigate various features, with a focus on high-precision performance for use in the detector, and examine performance results on a subset of the AMI Meeting Corpus. For the high-quality sig- nal case of a single mixed-headset channel signal, we demon- strate a relative improvement of about 7.4% DER over the baseline diarization system, while for the more challenging case of the single far-field channel signal relative improve- ment is 3.6%. We also outline steps towards improvement and moving beyond this initial phase. Index Terms— speaker diarization, overlap detection 1. INTRODUCTION The presence of overlapped, or co-channel, speech in meet- ings is a common occurrence and a natural consequence of the spontaneous multiparty conversations which arise within these meetings. This speech, in addition, presents a signifi- cant challenge to automatic systems that process audio data from meetings, such as speech recognition and speaker di- arization systems. In the case of speaker diarization, current state-of-the-art systems assign speech segments to only one speaker, thus incurring missed speech errors in regions where more than one speaker is active. For these systems, such as our own ICSI Diarization System [1], this error may represent a significant portion of the diarization error. For example, in previous RT diarization evaluations, up to 43% relative of the ICSI system diarization error consisted of missed speech er- rors due to overlap. To be certain, it is only recently that diarization error rates of systems have been reduced to the point that a large portion of the remaining error is due to overlap. As a result, little This work was partly supported by the Swiss National Science Founda- tion through the research network IM2 and the European Union 6th FWP IST Integrated Project AMIDA. work has been done on addressing the issues posed by the phenomenon. Some studies have been reported about the ef- fects of overlap in meetings (e.g.,[2],[3],and [4]), but work on systems for identifying overlapped speech and mitigating its effects in speaker diarization appear to be absent from the literature. As overlapped speech is now a major obstacle in improving the performance of speaker diarization systems, efforts in overlap detection will be of increasing interest and importance. With this view, we present in this paper our initial efforts toward addressing overlapped speech in automatic speaker di- arization. This consists of an overlap detection system along with a segment post-processing procedure for the segmenta- tion generated by the speaker diarization system. The overlap detector is an HMM-based segmenter that operates using fea- tures tailored for the task while the post-processing procedure is a speaker assignment method for the identified overlap seg- ments based on speaker posterior probabilities produced by the diarization system. As with any detection scheme, the overlap system is sus- ceptible to errors of two types: false alarms and misses. These errors impact the diarization system quite differently, with false alarms carrying through to increase the diarization false alarm error and misses having no effect on the baseline di- arization error. Because of this difference, the overlap detec- tor is optimized for low false alarms, which corresponds to a high precision (and possibly low recall) operating point. The remainder of this paper is organized as follows. The diarization system is briefly described in Section 2 and the HMM-based segmenter along with the segmenter features are described in Section 3. The diarization segment post- processing procedure is detailed in Section 4 and we present results on AMI development data in section 5. Finally, con- clusions and future work are given in Section 6. 2. THE ICSI DIARIZATION SYSTEM The goal of speaker diarization is to segment audio into speaker-homogeneous regions, ultimately to answer the ques- tion, “Who spoke when?”. In the ICSI diarization system, as with most state-of-the-art systems, this is accomplished through agglomerative clustering of segments with merging