Convolutional Neural Network to Model Articulation Impairments in Patients with Parkinson’s Disease J. C. V´ asquez-Correa 1,2 , J. R. Orozco-Arroyave 1,2 , E. N ¨ oth 2 1 Faculty of Engineering, University of Antioquia UdeA, Calle 70 No. 52-21, Medell´ ın, Colombia. 2 Pattern Recognition Lab, Friedrich-Alexander-Universit¨ at Erlangen-N ¨ urnberg, Germany. jcamilo.vasquez@udea.edu.co Abstract Speech impairments are one of the earliest manifestations in pa- tients with Parkinson’s disease. Particularly, articulation deﬁcits related to the capability of the speaker to start/stop the vibration of the vocal folds have been observed in the patients. Those difﬁculties can be assessed by modeling the transitions between voiced and unvoiced segments from speech. A robust strategy to model the articulatory deﬁcits related to the starting or stopping vibration of the vocal folds is proposed in this study. The tran- sitions between voiced and unvoiced segments are modeled by a convolutional neural network that extracts suitable informa- tion from two time–frequency representations: the short time Fourier transform and the continuous wavelet transform. The proposed approach improves the results previously reported in the literature. Accuracies of up to 89% are obtained for the classiﬁcation of Parkinson’s patients vs. healthy speakers. This study is a step towards the robust modeling of the speech im- pairments in patients with neuro–degenerative disorders. Index Terms: Parkinson’s disease, Articulation, Convolutional neural network, Time–frequency representations, Wavelet transform. 1. Introduction Parkinson’s disease (PD) is a neurological disorder that alters the function of the basal ganglia in the midbrain, producing mo- tor and non–motor deﬁcits in the patients [1]. Speech impair- ments are an early and prominent manifestation that can con- tribute primarily to the diagnosis of PD [2]. The main symptoms of the impaired speech of PD patients include reduced loudness, monopitch, monoloudness, hypotonicity, breathy, hoarse voice quality, and imprecise articulation. These symptoms are typi- cally grouped and called hypokinetic dysarthria [3]. Several studies in the literature have described the speech impairments of PD patients in terms of phonation, articulation, and prosody [4, 5, 6]. Phonation is related to the capability of the speaker to make the vocal folds vibrate to produce vocal sounds, articulation is related with the modiﬁcation of the po- sition, stress, and shape of several muscles to produce speech, and prosody reﬂects variation of loudness, pitch, and timing to produce natural speech. Articulation deﬁcits in PD patients are mainly related to reduced amplitude and velocity of lip, tongue, and jaw movements [7]. Particularly, imprecise consonant ar- ticulation was perceptually found as one of the most deviant speech dimensions in PD [8]. In general, articulation impairments of PD patients have been analyzed in several studies both from the medical and engineering perspective. In [5] the authors evaluated possible correlations between vowel articulation, global motor perfor- mance, and the stage of the disease. A total of 68 patients and 32 healthy control (HC) speakers are considered. According to the results obtained in several statistical tests, the authors con- cluded that the vowel articulation index (VAI) is signiﬁcantly reduced in PD speakers. In [9] six different articulatory deﬁcits in PD were modeled: vowel quality, coordination of laryngeal and supra-laryngeal activity, precision of consonant articula- tion, tongue movement, occlusion weakening, and speech tim- ing. The authors studied the rapid repetition of the syllables /pa- ta-ka/ pronounced by 24 Czech native speakers, and reported an accuracy of 88% discriminating between PD patients and HC. Articulation impairments have been also analyzed using time–frequency representations (TFR) [10], where three TFR were computed from continuous speech utterances with the aim of detecting changes in the low frequency components of the spectrum that could be associated to the presence of tremor in the speech. The TFR include modulation spectra, the wavelet packet transform, and the Wigner-Ville distribution. The au- thors extract features related to the energy content and spectral centroids in different frequency bands, and report an accuracy of up to 77% classifying PD patients and HC speakers using several classiﬁcation strategies. In [11] it was introduced a method to model difﬁculties ob- served in PD patients to start/stop the vibration of vocal folds. The method consists of detecting the transitions from voiced to unvoiced (v-uv), i.e. offset, and from unvoiced to voiced (uv-v), i.e. onset in the speech recording. Then the energy content in frequency bands separated according to the Bark scale is com- puted. In order to improve the method presented in [11], in the present study the onset and offset are modeled with a more robust strategy that considers both the temporal and frequency domains of the transitions. The onset and offset are modeled using two TFR: the short time Fourier transform (STFT) and the continuous wavelet transform (CWT). The TFRs are used to feed a convolutional neural network (CNN) that learns high– level representations from the low–level raw features from the TFR. The combination of TFRs and CNNs has been previ- ously used in speech recognition and other speech processing tasks [12, 13, 14]. The proposed model is tested in the classiﬁcation of PD pa- tients vs. HC subjects in three different languages: Spanish, German, and Czech. The results obtained are compared to a baseline computed with the strategy introduced in [11]. Ac- cording to the results, the proposed approach improves the re- sults relative to previous studies. Accuracies of up to 89% are obtained for the classiﬁcation of PD patients vs. HC speakers. This study is a step towards the robust modeling of the speech impairments in patients with neuro–degenerative disorders 2. Methods The proposed method is divided into three stages: (1) the de- tection of the onset and offset transitions, (2) the computation Copyright  2017 ISCA INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden http://dx.doi.org/10.21437/Interspeech.2017-1078 314