Mostafa Shahin 1 , Beena Ahmed 1 , Jacqueline McKechnie 2 , Kirrie Ballard 2 , Ricardo Gutierrez-Osuna 3 1 Dept.of Electrical and Computer Engineering, Texas A&M University, Doha, Qatar 2 Faculty of Health Sciences, The University of Sydney, Sydney, Australia 3 Dept. of Computer Science and Engineering, Texas A&M University, College Station, Texas Abstract This paper introduces a pronunciation verification method to be used in an automatic assessment therapy tool of child disordered speech. The proposed method creates a phone- based search lattice that is flexible enough to cover all probable mispronunciations. This allows us to verify the correctness of the pronunciation and detect the incorrect phonemes produced by the child. We compare between two different acoustic models, the conventional GMM-HMM and the hybrid DNN-HMM. Results show that the hybrid DNN- HMM outperforms the conventional GMM-HMM for all experiments on both normal and disordered speech. The total correctness accuracy of the system at the phoneme level is above 85% when used with disordered speech. Index Terms— Pronunciation verification, speech therapy, automatic speech recognition, computer aided pronunciation learning, deep learning 1. Introduction Language production and speech articulation can be delayed in children due to developmental disabilities and neuromotor disorders such as childhood apraxia of speech (CAS) [1]. Traditional CAS therapy requires a child undergo extended therapy sessions with a trained speech language pathologist (SLP) in a clinic; this can be both logistically and financially prohibitive. Interactive and automatic speech monitoring tools that can be used remotely by children in their own homes, offer a practical, adaptive and cost-effective complement to face-to-face intervention sessions with a SLP. A number of technology-based tools have been developed to facilitate general speech therapy, but a very limited number of them target the specific articulation problems of children with CAS [2], [3], [4]. The intuitive and engaging environment provided by tablets and smartphones has led to the development of generic speech therapy applications for mobile devices [5], [6]. The main drawback of all these systems is the absence of automatic feedback, which makes it hard to adapt the therapy regimen based on the specific needs of each child. There has been limited success in incorporating automatic speech recognition (ASR) systems into speech therapy tools. This is due to the higher error rates ASR systems still exhibit for developing children due to variations in vocal tract length, formant frequency, pronunciation and grammar. Perceptual evaluations of apraxic speakers can be inconsistent and prone to error [7]. The Speech Training, Assessment, and Remediation system (STAR) [8] evaluates phoneme production by calculating the likelihood ratio produced by aligning the subject’s speech using the target phoneme and alternative phonemes. In Vocaliza [9], a set of confidence measures are used to score the phoneme pronunciation level. Both systems decide whether the phoneme was pronounced correctly or incorrectly without actually detecting the errors made by the child. ASR has also been used widely in the area of second language learning. As an example, Kim et al [10], define a set of rules of the expected mispronunciations of the native Korean speakers when pronouncing an English word were defined to detect pronunciation errors. In Hafss [11], a search lattice was created from all probable pronunciation variants and fed to a speech decoder to identify errors in Quranic Arabic. In our previous work [12], we proposed an automated therapy tool for child with CAS. The proposed system consists of 1) a clinician interface where the SLP can create and assign exercises to different children and monitor each child’s progress, 2) a tablet-based mobile application which prompts the child with the assigned exercises and records their speech, and 3) a speech processing module installed on a server that receives the recorded speech, analyzes it and provides feedback to the SLP with the assessment results. The SLP can then update the exercises assigned to each child as per the feedback received. The speech processing module consists of multiple components that specialize in identifying the types of errors made by children with CAS. In [13] we presented a lexical stress classifier to detect prosodic errors. In this paper, we enhance our earlier pronunciation verification method [12], by creating a search lattice that contains all the expected mispronunciation phonemes which includes a garbage model to collect any unexpected inserted phonemes. We also use a penalty value in both the alternative and garbage paths to control the strictness of the system. We compare the performance of two different acoustic models, the conventional GMM-HMM and the hybrid DNN-HMM [14], which has been reported to outperform the conventional GMM-HMM model in other applications [15], [16] particularly with smaller training datasets [17]. The proposed method allows us to verify the correctness of phoneme pronunciation with higher accuracy than previous pronunciation verification systems [8], [9] and provides a mechanism to detect the error type (insertion, deletion or substitution) made, if any. The remainder of this paper is structured as follows. Section 2 describes the method and the speech corpus used. Section 3 presents the experiments performed and results. Finally, the conclusions are summarized in section 4. 2. Methods 2.1. System description A Comparison of GMM-HMM and DNN-HMM Based Pronunciation Verification Techniques for Use in the Assessment of Childhood Apraxia of Speech Copyright  2014 ISCA 14 - 18 September 2014, Singapore INTERSPEECH 2014 1583