Automatic detection of voice impairments from text-dependent running speech J.I. Godino-Llorente a, *, Rube ´ n Fraile a , N. Sa ´ enz-Lecho ´n a , V. Osma-Ruiz a , P. Go ´ mez-Vilda b a Department of Circuits and Systems Engineering, Universidad Polite ´cnica de Madrid, Ctra Valencia Km 7, 28031 Madrid, Spain b Department of Computer Science and Engineering, Universidad Polite ´cnica de Madrid, Spain 1. Introduction It is well known that most of the voice disorders are characterized by an increase of mass, a lack of closure, and/or a change in the elasticity of the vocal folds. The result is that the movement of the vocal folds is not well balanced and an incomplete closure of the vocal folds may appear in some or all glottal cycles. So the whole harmonic structure is modiﬁed (increasing the inter-harmonic energy and the fundamental frequency perturbation), and the energy is increased at high- energy components due to the turbulences caused by an incomplete closure of the glottal cleft. This is reﬂected in the speech especially during voiced sounds since in these segments the vocal folds are in movement. With the aim of measuring the perturbations that appear in presence of pathology, current panorama of acoustic analysis allows us to calculate a great amount of long-term averaged acoustic parameters. Such parameters (jitter, shimmer, Harmonics to Noise Ratio (HNR), Normalized Noise Energy (NNE), Voice Turbulence Index (VTI), Glottal to Noise Excitation Ratio (GNE), Signal to Noise Ratio (SNR), Frequency Amplitude Tremor (FATR), and many others [1–6]) were developed to measure quality and ‘‘degree of normality’’ of voice registers from the sustained phonation of vowels. Using this acoustic material, previous studies [7–9] indicate that the detection of voice alterations can be carried out by means of the above-mentioned long-term averaged acoustic parameters, enabling each individual voice utterance to be quantiﬁed by a single vector. The main drawback of most of these parameters is that they relay on an accurate estimation of the fundamental frequency, a rather complicated task in presence of certain pathologies. There are also works in the literature that use short-time features for the detection of voice impairments. Some of them address this task from the excitation waveform collected with a laryngograph [10] or extracted from the acoustic data by inverse ﬁltering methods [11]. Again, since inverse ﬁltering use to be based on the assumption of a linear model, such methods do not behave well when pathology is present, due to non-linearities introduced by the pathology itself. Other authors have also proposed non-linear processing for the same task [12]. However, the pathology that dominates spoken voice quality during conversation may have little effect on the quality of a vowel sustained at comfortable pitch. Unfortunately, the parameters (and methods) previously enumerated cannot be easily applied to connected speech due to coarticulations, onset and offset effects, and suprasegmental variations. By contrast, there are few studies addressing the detection of voice impairments from running speech samples. In [13] the author extended the HNR concept to continuous speech signals, Biomedical Signal Processing and Control 4 (2009) 176–182 ARTICLE INFO Article history: Received 9 June 2008 Received in revised form 28 January 2009 Accepted 29 January 2009 Available online 9 March 2009 Keywords: Running speech Pathological voices Mel cepstral parameters Noise parameters Voiced detection Multilayer perceptron ABSTRACT Acoustic analysis is a useful tool to diagnose voice diseases. Furthermore it presents several advantages: it is non-invasive, provides an objective diagnostic and, also, it can be used for the evaluation of surgical and pharmacological treatments and rehabilitation processes. Most of the approaches found in the literature address the automatic detection of voice impairments from speech by using the sustained phonation of vowels. In this paper it is proposed a new scheme for the detection of voice impairments from text-dependent running speech. The proposed methodology is based on the segmentation of speech into voiced and non-voiced frames, parameterising each voiced frame with mel-frequency cepstral parameters. The classiﬁcation is carried out using a discriminative approach based on a multilayer perceptron neural network. The data used to train the system were taken from the voice disorders database distributed by Kay Elemetrics. The material used for training and testing contains the running speech corresponding to the well known ‘‘rainbow passage’’ of 140 patients (23 normal and 117 pathological). The results obtained are compared with those using sustained vowels. The text-dependent running speech showed a light improvement in the accuracy of the detection. ß 2009 Elsevier Ltd. All rights reserved. * Corresponding author. Tel.: +34 91 3367829; fax: +34 91 3367829. E-mail address: igodino@ics.upm.es (J.I. Godino-Llorente). Contents lists available at ScienceDirect Biomedical Signal Processing and Control journal homepage: www.elsevier.com/locate/bspc 1746-8094/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.bspc.2009.01.007