An Empirical Analysis of Word Error Rate and Keyword Error Rate Youngja Park 1 , Siddharth Patwardhan 2 , Karthik Visweswariah 3 , Stephen C. Gates 1 1 IBM T.J. Watson Research Center, 19 Skyline Dr. Hawthorne, NY, USA 2 School of Computing, University of Utah, Salt Lake City, UT, USA 3 IBM India Research Lab, New Delhi, India {young park,scgates}@us.ibm.com, sidd@cs.utah.edu,v-karthik@in.ibm.com Abstract This paper studies the relationship between word error rate (WER) and keyword error rate (KER) in speech transcripts and their effect on the performance of speech analytics applications. Automatic speech recognition (ASR) systems are increasingly used as input for speech analytics, which raises the question of whether WER or KER is the more suitable performance met- ric for calibrating the ASR system. ASR systems are typically evaluated in terms of WER. Many speech analytics applications, however, rely on identifying keywords in the transcripts—thus their performance can be expected to be more sensitive to key- word errors than regular word errors. To study this question, we conduct a case study using an ex- perimental data set comprising 100 calls to a contact center. We first automatically extract domain-specific words from the man- ual transcription and use this set of words to calculate keyword error rates in the following experiments. We then generate call transcripts with the IBM Attila speech recognition system, us- ing different training for each repetition to generate transcripts with a range of word error rates. The transcripts are then pro- cessed with two speech analytics applications, call section seg- mentation and topic categorization. The results show similar WER and KER in high-accuracy transcripts, but KER increases more rapidly than WER as the accuracy of the transcription de- teriorates. Neither speech analytics application showed signif- icant sensitivity to the increase in KER for low-accuracy tran- scripts. Thus this case study did not identify a significant differ- ence between using WER and KER. Index Terms: speech analytics, keyword error rate, word error rate 1. Introduction Continuing advances in automatic speech recognition (ASR) have enabled many useful speech analytics applications such as topic identification and information retrieval of speech texts. The performance of speech recognition is typically measured using word error rate (WER), the ratio of word insertion, substi- tution, and deletion errors in a transcript to the total number of spoken words. While it is widely believed that word error rate of a speech recognition system strongly influences the perfor- mance of speech analytics systems [1, 2], some researchers no- ticed a divergence between word error rate and the accuracy of speech understanding systems [3, 4, 5, 6]. In certain cases, the accuracy of speech understanding applications improved even though WER in the speech transcripts increased [4, 6]. This has lead researchers to suggest that ASR systems for speech analyt- ics should be trained to optimize keyword error rate (KER) in- stead of WER, since many speech analytics applications depend mostly on certain keywords. Sandness and Hetherington [5] propose a keyword-based method to train the acoustic model. Also, Nanjo and Kawahara [3] present a decoding strategy to minimize weighted keyword error rate. The use of keyword error rate also raises the challenge of how to obtain the list of domain-specific keywords. Previ- ous work has experimented with relatively constrained domains such as air travel as in ATIS (Air Travel Information System) [7] or weather in the JUPITER system [8], where keywords only included a very limited set of place names and expressions for date and time [5, 6]. Similarly, research by Nanjo and Kawa- hara [3] simply used all nouns, except pronouns and numbers, as keywords. Therefore, it still remains open questions how to best measure performance when ASR is used in combination with speech analytics, and whether WER or KER is the more suitable performance metric for calibrating the ASR system. In this work, we address these questions by investigating the relationship between WER and KER and their effect on the performance of speech analytics applications for a more uncon- strained domain of IT help desk contact center calls. The calls concern many different topics including issues with network connectivity, user password and various software tools. We first present an automatic method to extract domain-specific words from a given transcript collection, which can serve as keywords. We then use the IBM Research Attila Speech recognition toolkit to transcribe the calls [9], producing several transcripts with dif- ferent level of accuracy by retraining the language model, and study the effect of WER and KER on two speech analytics ap- plications: a topic categorization system that is highly keyword dependent, and a call section segmentation which does not rely on keywords. Our experimental results show KER increases more rapidly than WER as the accuracy of speech recognition decreases. KER increases by 200% while WER only increases by 128%, from a transcript with 41.64% WER to a transcript with 18.24% WER. WER and KER are very similar for higher accuracy tran- scripts. The experiments did not reveal a significant effect of higher KER on the speech analytics applications. 2. Keyword Extraction and Keyword Error Rate Measurement It is very difficult to define and collect keywords for an uncon- strained domain such as contact center calls. In this work, we define keywords as domain-specific nouns and verbs. If a word occurs relatively more frequently in a domain-specific text than in a non-domain text, the word is regarded as domain-specific. Based on this notion, we define domain specificity of a word as the relative probability of occurrence of the word in a domain Proceedings of the International Conference on Spoken Language Processing, Brisbane, Australia, September 2008