A Metric for Evaluating Speech Recognizer Output based on Human-Perception Model Nobuyasu Itoh 1 , Gakuto Kurata 1 , Ryuki Tachibana 1 , Masafumi Nishimura 2 1 IBM Research, Toyosu, 135-8511, JAPAN, 2 Graduate School of Informatics, Shizuoka University (iton,gakuto,ryuki)@jp.ibm.com, nisimura@inf.shizuoka.ac.jp Abstract 1 Word error rate or character error rate are usually used as the metrics for evaluating the accuracy of speech recognition. These are naturally-defined objective metrics and are helpful for comparing recognition methods fairly. However the overall performance of the recognition systems and the usefulness of the results are not necessarily considered. To address this problem, we study and propose a metric which replicates human-annotated scores using their perception to the recognition results. The features that we use are the numbers of insertion errors, deletion errors, and substitution errors in the characters and the syllables. In addition we studied the numbers of consecutive errors, the misrecognized keywords, and the locations of errors. We created models using linear regression and random forest, predicted human-perceived scores, and compared them with the actual scores using Spearman’s rank-based correlation. According to our experiments the correlation of human perceived scores with character error rates is 0.456, while those with the predicted scores by using a random forest of 10 features is 0.715. The latter is close to the averaged correlation between the scores of the human subjects, 0.765, which suggests that we can predict the human-perceived scores using those features and that we can leverage human perception model for evaluating speech recognition performance. The important factors (features) for the prediction are the numbers of substitution errors and consecutive errors. Index Terms: speech recognition, evaluation, word error rate, character error rate, human perception 1. Introduction Evaluation metrics are important for comparing various algorithms for speech recognition and assessing which approach outperforms the others. The WER (Word Error Rate) has been most often used in research work on speech recognition [1]. The CER (Character Error Rate) is also popular in research for some languages (such as Japanese), where the word units can be ambiguous [2]. In some applications, task-oriented metrics have also been tried. For example Levit proposed a metric for end-to-end accuracy evaluation in voice-enabled search tasks [3]. This work was conducted when the author was a member of IBM Research - Tokyo In addition, speech-enabled dialog systems have been usually evaluated by the concept error rate, task completion ratio, or required time and average number of turns to the completion [4][5], because the quality of dialog systems is affected not only by the recognition accuracy but also by interpretation algorithms and strategies of the dialog management. However, in many speech applications such as dictation, voicemail transcription, or message creation, the quality of the output text, (its readability and understandability), is the largest factor for evaluating the speech recognition systems. In this paper we study the human-assessed quality of Automatic Speech Recognition (ASR) output to rank the transcriptions in Japanese where the orthographic and word units are hard to define. This paper is organized as follows: in Section 2 we survey related works. In Section 3 the strategy for collecting human perceived scores from human annotators is described. In Section 4, we discuss the features to be used for defining our metrics. Section 5 presents our experimental results using two frequently used approaches: linear regression and random forest, a well-known non-linear prediction method. In Section 6, we discuss the results, followed by some concluding remarks. 2. Related works Jones [6] studied the readability of ASR transcripts, and reported that certain metadata (capitalization, punctuation, and disfluencies (removed or not)) significantly influenced the human-perceived readability. Nanjo [7] proposed the Weighted Key word Error Rate (WKER) as a more suitable metric for evaluating ASR used in applications. The ASR quality problem is similar in difficulty to assessing (or automatically predicting) human reactions to machine translation output . BLEU was proposed as a more objective metric for evaluating machine translation [8]. However, the relation between the metrics for speech recognition output and human-perceived quality has not been sufficiently investigated. We are investigating what and how the recognition errors affect human perception of the ASR output. Our goal is to the predict human perceived quality scores using surface features in the ASR output, then to leverage the prediction model for evaluating ASR. Copyright  2015 ISCA September 6 - 10, 2015, Dresden, Germany INTERSPEECH 2015 1285 10.21437/Interspeech.2015-321