Segmentation of Telephone Speech Based on Speech and Non-Speech Models Michael Heck, Christian Mohr, Sebastian St ¨ uker, Markus M¨ uller, Kevin Kilgour, Jonas Gehring, Quoc Bao Nguyen, Van Huy Nguyen, and Alex Waibel Institute for Anthropomatics, Karlsruhe Institute of Technology, Germany {heck,christian.mohr,sebastian.stueker,m.mueller,kevin.kilgour, jonas.gehring,quoc.nguyen,van.nguyen,waibel}@kit.edu Abstract. In this paper we investigate the automatic segmentation of recorded telephone conversations based on models for speech and non-speech to find sentence- like chunks for use in speech recognition systems. Presented are two different approaches, based on Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs), respectively. The proposed methods provide segmentations that allow for competitive speech recognition performance in terms of word error rate (WER) compared to manual segmentation. Keywords: support vector machines, segmentation, speech activity detection 1 Introduction Speech recognition in telephone calls is still one of the most challenging speech recog- nition tasks to-date. Besides the special acoustic conditions that degrade input features for acoustic modelling, the speaking style in telephone conversations is highly sponta- neous and informal. Each channel of a conversation contains large parts with no speech activity. Assuming equal participation of both speakers in the conversation, at least 50% per channel can therefore be omitted for recognition. Omitting non-speech segments on one hand improves recognition speed and on the other hand can improve the recognition accuracy since insertions due to falsely classified noises in the non-speech segments can be avoided, which is especially promising in the variable background noise conditions of telephone and mobile phone conversations. We investigate two methods of automatic segmentation to determine sentence like chunks of speech and filter out non-speech segments for speech recognition. As a base- line we regard the segmentation on the output of a regular speech recognizer. Our exper- imental setups make use of a GMM based decoder method and an SVM based method. Evaluation is done according to speech recognition performance since references for speech segments are not very accurate. The evaluation took place on corpora of four distinct languages, that were recently released as the IARPA Babel Program [1] language collections. babel106b-v0.2f and the subset babel106b-v0.2g-sub-train cover Tagalog and are used in two training data conditions, unlimited and limited, respec- tively. In the unlimited scenario, a full data set covering approximately 100 hours of transcribed audio material was available for training, whereas for the limited case only