PHONOLOGICAL FEATURES IN DISCRIMINATIVE CLASSIFICATION OF DYSARTHRIC SPEECH Frank Rudzicz Department of Computer Science, University of Toronto frank@ai.toronto.edu ABSTRACT In an attempt to overcome problems associated with articulatory limitations and generative models, this work considers the use of phonological features in discriminative models for disabled speech. Speciﬁcally, we train feed-forward and recurrent neural networks, and radial basis and sequence-kernel support vector machines to ab- stractions of the vocal tract, and apply these models to phone recog- nition on dysarthric speech. The results show relative error reduc- tion of between 1.5% and 10.9% with this approach against standard hidden Markov modeling, and increases in accuracy with speaker in- telligibility across all classiﬁers. This work may be applied within components of assistive software for speakers with dysarthria. Index Terms— dysarthria, neural networks, kernel methods 1. INTRODUCTION Dysarthria comprises a group of neuromuscular disorders that can drastically limit speech intelligibility in congenital cases such as cerebral palsy or traumatic ones such as stroke. These disorders typically limit motor function generally, making other physical in- teraction (e.g., keyboard) slower, and less desirable than spoken ex- pression [1]. Unfortunately, automatic speech recognition is cur- rently ill-suited to dysarthric speech, rendering such software inac- cessible to those who might most beneﬁt from it. We have found that traditional generative approaches such as hidden Markov models (HMMs) trained for speaker independence may achieve word-level accuracy of less than 4.5% on severely dysarthric speech against 84.8% on non-disabled speech on short sentences [2]. Disabled speech is typically characterized by a limited range of motion in the speech articulators, which results in smaller vowel spaces and more inconsistent consonants, especially in clusters [3]. As these phones assimilate with one another, generative models as- sign more probability to overlapped spaces, hurting performance. In this paper we consider two discriminative families for stochastic classiﬁcation, neural networks (NNs) and support vector machines (SVMs), on the task of differentiating phones at the frame level for disabled speech. Since this speech is characterized by differences in physical production, our goal is to determine whether abstract repre- sentations of dysarthric articulation are easily discriminable in dis- ordered speech, and whether these are useful in speech recognition for this population generally. 1.1. Phonological features Phonological features (PFs) 1 are quantized abstractions of speech production along particular vocal tract conﬁgurations. For example, 1 PFs are often called articulatory features. the Front/Back feature speciﬁes the sagittal position of the tongue during vowels, and Static speciﬁes the rate of acoustic change (e.g., diphthongs are dynamic). Because PFs can change asynchronously across phonetic boundaries and are more ﬁne-grained than phone- mic representations, their use has been shown to partially account for coarticulation effects and speaker variability [4], which are par- ticularly exacerbated in dysarthric speech. Other useful properties of PFs include noise-robustness, language-independence, and reliable recovery from acoustics among regular speakers [5]. The features used here are based on those of Wester [6], and listed in Table 1. Feature Values (with Cardinality) Manner approximant, fricative, nasal, retroﬂex, si- lence, stop, vowel (7) Place alveolar, bilabial, dental, labiodental, si- lence, velar, nil (7) High/Low high, mid, low, silence, nil (4) Voice voiced, unvoiced (2) Front/Back front, central, back, nil (4) Round round, non-round, nil (3) Static static, dynamic (2) Table 1. Phonological features and their possible values. 2. PHONOLOGICAL-ACOUSTIC MODELS In this paper, acoustic observation vectors are frames of speech op- tionally surrounded by a window of varying length. Each PF is mod- eled by two NNs and two SVMs for each speaker, as described be- low. Additionally, for each of these four discriminative techniques, we construct three triphone classiﬁers. The ﬁrst identiﬁes triphones solely by acoustics, the second based solely on output from the 7 PF classiﬁers, and the third based on a combination of these. Nonexis- tent triphones in the training data are modeled by their monophonic progenitors, of which there are 61. 2.1. Neural Networks Multilayer neural networks have rarely been applied to classiﬁcation within dysarthric speech, despite their popularity in general. One study, however, showed that multilayer feed-forward NNs supplied with either Fourier spectral coefﬁcients or formant frequencies could achieve a relative error reduction (RER) of up to 40% over a com- mercial HMM-based system for a cerebrally palsied speaker [7]. The two types of neural network we consider here are the feed- forward multi-layer perceptron (MLP) and the recurrent Elman net- work (ELM), which are primarily distinguished by the latter’s time- delayed replication of the hidden layer as additional contextual in-