IPSJ SIG Technical Report DNN-based GOP and Its Application to Automatic Assessment of Shadowing Speeches Junwei Yue 1 Fumiya Shiozawa 1 Shohei Toyama 1 Yutaka Y amauchi 2 Kayoko Ito 3 Daisuke Saito 1 Nobuaki Minematsu 1 Abstract: Shadowing is currently one of the most popular research topics in CALL (Computer Assisted Language Learning). Our previous studies realized automatic assessment using the GOP (Goodness of Pronunciation) scores, and made a step toward automatically generating corrective feedbacks for shadowing speeches. In this study, we col- lected English shadowing speeches from Japanese university students. Manual scores of these speeches are given by a bilingual English teacher. Using this labeled corpus, we investigated automatic proficiency assessment using DNN (Deep Neural Network) based acoustic models. Here GOP (Goodness of Pronunciation) scores were estimated using DNN and they were compared to GMM-based GOP scores in terms of assessment performance. Further, DTW (Dy- namic Time Wrapping) distances between learners’ shadowed utterances and model utterances were calculated using posterior vectors. This DTW-based score was also compared to GOP-based scores. The result suggests that DNN based approach shows better performance than traditional GMM based ones. In the DTW-based comparison, language independency was also discussed. Keywords: CALL, Shadowing, Corpus, Assessment, GOP, DNN, DTW, Language Independency 1. Introduction Shadowing is a task which requires the speaker to repeat the played audio immediately while listening to it. It has been adopted as a practicing strategy for simultaneous interpreters since it includes not only speaking and listening, but also com- prehending speech. Recently, many researches have shown that shadowing is also effective for language learning, especially for second language learning [1], [2], [3]. All of these studies sug- gested that shadowing could be more or at least no less effec- tive in terms of improving speakers’ language skills than tra- ditional practicing strategies such as extensive reading, reading aloud and listening. However, learners need corrective feedbacks on their shadowing speeches. This work is usually done by lan- guage teachers so far, which requires a large amount of human re- sources. One of the solutions is to estimate the proficiency scores and generate corresponding feedbacks automatically. To train and evaluate estimation models, a corpus of shadowing speeches with manual scores labeled is also required. In our previous studies, we adopted GMM-based GOP (Good- ness of Pronunciation) scores as automatically estimated shad- owers proficiency [4]. We also made a step toward automatic corrective feedback generation, where shadowing errors in a sub- set of the corpus were transcribed [5]. Here, GOP was adopted as one feature to predict proficiency scores using regression models. Previous results suggested that GMM-based GOP scores have good correlation with TOEIC scores when language proficien- 1 The University of Tokyo 2 Tokyo International University 3 Kyoto University cies of learners are well distributed and the recorded speeches are clean. In the case that speeches are recorded with background noise and many speakers have similar language proficiencies, the correlation drops down dramatically. This could be alleviated by introducing some other features and performing regression anal- ysis [5]. However, it has been long doubted that whether it is reason- able to adopt TOEIC scores as performance metric of language proficiency of shadowing since TOEIC tests do not contain any speaking tests until a few years ago. In addition, the size of the corpus used in our previous study [4] is not sufficient since only about 40 speakers participated in those experiments. Thus, in this study, we collected English shadowing speeches from 125 univer- sity students for a wider examination. A bilingual English teacher manually scored these speeches by paying attention to the fact that these utterances were obtained from shadowing practices. By using these scores as the ground truth of learners real shadow- ing performance, DNN (Deep Neural Network) based and GMM- based GOP scores are computed. On the other hand, DTW (Dy- namic Time Wrapping) distances between shadowed and model speeches are computed using DNN-based posteriors, and the re- sults are compared with DNN-based GOP scores. Here, language independency was also discussed. 2. Corpus collection As previously mentioned, we collected English shadowing speeches from university student learners in Japan. An online shadowing recording site was developed for this data collection. It can be used in both shadowing practice and recording. 125 c 2017 Information Processing Society of Japan 1 Vol.2017-SLP-115 No.13 2017/2/18