175 978-1-4244-8820-9/10/$26.00 ©2010 IEEE Abstract—This paper addresses the problem of vowels recognition in patients after total laryngectomy using combined visual and acoustic features. The linear prediction coefficients were estimated from speech signal using weighted recursive least squares algorithm. Ten cross-sectional areas of vocal tract model were calculated. Face expression parameters related to the spoken vowel were extracted from video recordings. Lips width, lips height and jaw opening were measured from grabbed video frames. The principal component analysis was applied to show correlations of auditory and visual features. The vowel recognition procedures were based on single hidden layer neural networks. The recognition performances of visual, acoustic and fused modalities were compared. It was presented that recognition performance of sustained vowels using 10 cross-sectional areas estimates is very low. Facial expression analysis is needed when there is problem with estimation of standard acoustic parameters of pathological speech. I. INTRODUCTION A. Laryngectomees’ Speech ARYNGECTOMY is a partial or complete removal of the larynx. It is usually performed as a treatment for laryngeal carcinoma. Following the loss of vocal cords patients are not able to phonate adequately. Their voice is hoarse, weak and stained. The main problem is to pronounce vocalized sounds that are naturally articulated with the use of vocal cords. The main goal of phoniatric rehabilitation is to learn how to articulate alternative sounds. In esophageal speech alaryngeal voice is articulated using pharyngo- esophageal segment. However the certain percentage of laryngectomees never acquires an alternative voice. They communicate with silently articulated words called pseudo- whisper. B. Speech Signal Enhancement Several issues make evaluation of the acoustical descriptors of laryngectomees’ speech difficult. Noise from tracheostoma plays significant role in masking speech spectrum [1]. The effective length of vocal tract after laryngectomy is shortened and the formant frequencies are This publication is co-financed by The European Union from The European Regional Development Fund under the Operational Program Innovative Economy, 2007-2013. Rafal Pietruch is with the Industrial Research Institute for Automation and Measurements, Al. Jerozolimskie 202, 02-486 Warsaw (telephone: (+4822) 874–03–66, fax: (+4822) 874–02–20, e-mail: rpietruch@piap.pl). Antoni Grzanka is with the Institute of Electronic System, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw (e-mail: a.grzanka@ise.pw.edu.pl). shifted to higher frequencies [1]–[4]. The main goal of research is to enhance the speech to make it more intelligible for listeners. In [5] authors developed special-purpose DSP hardware unit which replaced voicing sources of esophageal speech using the formant analysis-synthesis approach. The authors of [6] used digital signal processing technique based on modeling radiated pulses in frequency domain. C. Multimodal Speech Recognition It is hard to perform the automatic recognition of sustained vowels in pseudo-whisper speech using standard acoustic features (F1 and F2 formants) [1]. The authors concluded that supplementary information is essential for vowels recognition in pathological speech. Several multimodal approaches have been used for laryngectomees’ speech recognition. The example application with the magnets placed on the lips, teeth and tongue was presented in [7]. Miniature device with magnetic sensors reads the speech according to vocabulary database prepared for each patient. Apart from variety esophageal speech analysis and enhancement approaches, considerable attention has been paid to methods of acoustical and visual modalities fusion. Although visual signals provide only partial, ambiguous description of the vocal tract this information is complementary to the standard acoustic features. Most of applications combining the visual and auditory modalities are applied to noisy environment [8]–[13]. Pathological speech of laryngectomees can be even more complex problem than noisy environment. In earlier research authors demonstrated that visual features are promising candidates for pathological speech recognition [14], [15]. The visual modality is important factor in pseudowhisper voice enhancement.. D. Aims Authors developed a computerized measurement station to achieve speech factors of laryngectomees [1]. The computer program was developed to evaluate the progress of patient’s rehabilitation process. The aim of presented study was too find correlations between video and audio features. II. METHODS A. Subjects Twelve males who underwent total laryngectomy and 12 volunteers (7 males) participated in the research. As a material we used pronunciation of 6 sustained Polish vowels ’a’, ’i’, ’e’, ’y’, ’o’, ’u’(see appendix A for related IPA symbols). Combining Acoustic and Visual Modalities in Vowel Recognition System for Laryngectomees Rafal Pietruch, and Antoni Grzanka L