Mispronunciation Diagnosis of L2 English at Articulatory Level Using Articulatory Goodness-Of-Pronunciation Features Hyuksu Ryu 1 , Minhwa Chung 1 1 Department of Linguistics, Seoul National University, Seoul, Republic of Korea oster01@snu.ac.kr, mchung@snu.ac.kr Abstract This paper proposes a method to provide an articulatory diag- nosis of English produced by Korean learners using articulatory Goodness-Of-Pronunciation (aGOP) features, which are based on the distinctive feature theory in phonology. Previous studies on mispronunciation diagnosis have mainly dealt with pronun- ciation errors at phone-level. They inform learners of which phone is recognized as a diagnosis, when the corresponding segment is realized as a mispronunciation. However, to pro- vide learners more effective corrective feedback, diagnosis had better be performed at articulatory-level, such as place and man- ner of articulation, rather than at phone-level. This study aims to provide automatic articulatory diagnosis using articulation- based confidence scores. At first, the speech of learners is forced-aligned and recognized to compute the GOP and aGOPs. When the forced-aligned segment is a consonant, articulatory diagnosis is conducted in three articulatory categories: voicing, place of articulation, and manner of articulation. Otherwise, di- agnosis is performed in terms of rounding, height, and backness corresponding to articulatory characteristics of vowels. Experi- mental results show that F1 scores for voicing, place, and man- ner corresponding to consonants are 0.828, 0.754, and 0.781, re- spectively, whereas F1 score for rounding, height, and backness corresponding to vowels are 0.843, 0.782, and 0.824, respec- tively. These results indicate that the proposed method yields effective articulatory diagnosis. Index Terms: articulatory Goodness-Of-Pronunciation, mis- pronunciation diagnosis, CAPT, English produced by Korean learners 1. Introduction Corrective feedbacks explaining where learners are making pro- nunciation errors and how to correct them are essential for Computer-Assisted Language Learning (CALL) and Computer- Assisted Pronunciation Training (CAPT) systems [1]. That is, mispronunciation detection and diagnosis of mispronunciation using speech technology are necessary for conducting effective CALL/CAPT. There have been several studies to detect pronunciation er- rors of learners [2][3][4][5]. The study of [2] suggested an ex- tended recognition network (ERN), which expands pronunci- ation dictionaries of learners by predicting frequent erroneous pronunciation sequences. When the erroneous pronunciation sequences are recognized, it is considered that learners made pronunciation errors. However, ERN approach has difficulties in identifying mispronunciation patterns that learners frequently show in terms of each L1-L2 pair. Also, it is difficult to guar- antee that ERN covers most of the possible mispronunciations [6]. Another approach for detecting pronunciation errors is using confidence scores such as Goodness-Of-Pronunciation (GOP) [3][4]. Confidence score-based approach has virtues that it has L1/L2 independence and it is easy to compute [7]. How- ever, it is difficult to provide corrective feedback, since learn- ers do not know how to interpret with confidence scores alone and improve their pronunciation with the scores. Diagnosis for the detected pronunciation errors was not provided in these re- searches. Several previous studies [6][8][9] conducted diagnosis for mispronunciation as well as detection of pronunciation errors. Li et al. [6] suggested multi-distribution DNN (MD-DNN) by using acoustic features, graphemes, and canonical pronuncia- tion as inputs of DNN to predict actual pronunciation of learn- ers. When the predicted pronunciation is different from the canonical pronunciation, it is considered as a mispronunciation. Wang and Lee [8] proposed hierarchical multi-layer perceptrons (MLPs). First MLP is binary and classifies each frame as cor- rect or incorrect. Then, second MLP classifies each frame iden- tified as incorrect by the first MLP into one of the Error Patterns (EP) as diagnosis. Xie et al. [9] extracted landmark features for nasal codas spoken by learners of Chinese, and detected pro- nunciation errors by applying SVM. In these studies, diagnosis is performed in a hierarchical way as shown in Figure 1. At first, the pronunciation error detector that they proposed distin- guishes between mispronunciation and correct pronunciation. In addition to binary mispronunciation detector, diagnosis is carried out for instances which are correctly detected as mis- pronunciations (True Rejection in Figure 1). The diagnosis per- formance is reported by diagnosis error rate (DER), which is de- fined as the percentage of incorrectly recognized phones among correctly identified as a mispronunciation. Pronunciation segments Mispronunciation Correct pronunciation False Acceptance (FA) True Rejection (TR) True Acceptance (TA) False Rejection (FR) Correct Diagnosis (CD) Diagnostic Error (DE) Figure 1: Hierarchical structure for mispronunciation detection and diagnosis presented in [6][8] These hierarchical approaches for diagnosis have a limita- tion that they provide diagnosis at phone level only. For ex- ample, let’s assume a learner pronounced a word ‘give’ /gIv/ as /gIb/. When the CAPT system detects pronunciation error at coda position and recognizes the phone as /b/, the system reports a diagnosis of /v//b/. However, for more effective corrective feedback or learners, it had better provide diagno- sis information at articulatory level such as voicing, place, and 7th ISCA Workshop on Speech and Language Technology in Education 25-26 August 2017, Stockholm, Sweden 65 10.21437/SLaTE.2017-12