175
978-1-4244-8820-9/10/$26.00 ©2010 IEEE
Abstract—This paper addresses the problem of vowels
recognition in patients after total laryngectomy using combined
visual and acoustic features. The linear prediction coefficients
were estimated from speech signal using weighted recursive
least squares algorithm. Ten cross-sectional areas of vocal tract
model were calculated. Face expression parameters related to
the spoken vowel were extracted from video recordings. Lips
width, lips height and jaw opening were measured from
grabbed video frames. The principal component analysis was
applied to show correlations of auditory and visual features.
The vowel recognition procedures were based on single hidden
layer neural networks. The recognition performances of visual,
acoustic and fused modalities were compared. It was presented
that recognition performance of sustained vowels using 10
cross-sectional areas estimates is very low. Facial expression
analysis is needed when there is problem with estimation of
standard acoustic parameters of pathological speech.
I. INTRODUCTION
A. Laryngectomees’ Speech
ARYNGECTOMY is a partial or complete removal of
the larynx. It is usually performed as a treatment for
laryngeal carcinoma. Following the loss of vocal cords
patients are not able to phonate adequately. Their voice is
hoarse, weak and stained. The main problem is to pronounce
vocalized sounds that are naturally articulated with the use
of vocal cords. The main goal of phoniatric rehabilitation is
to learn how to articulate alternative sounds. In esophageal
speech alaryngeal voice is articulated using pharyngo-
esophageal segment. However the certain percentage of
laryngectomees never acquires an alternative voice. They
communicate with silently articulated words called pseudo-
whisper.
B. Speech Signal Enhancement
Several issues make evaluation of the acoustical
descriptors of laryngectomees’ speech difficult. Noise from
tracheostoma plays significant role in masking speech
spectrum [1]. The effective length of vocal tract after
laryngectomy is shortened and the formant frequencies are
This publication is co-financed by The European Union from The
European Regional Development Fund under the Operational Program
Innovative Economy, 2007-2013.
Rafal Pietruch is with the Industrial Research Institute for Automation
and Measurements, Al. Jerozolimskie 202, 02-486 Warsaw (telephone:
(+4822) 874–03–66, fax: (+4822) 874–02–20, e-mail: rpietruch@piap.pl).
Antoni Grzanka is with the Institute of Electronic System, Warsaw
University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw (e-mail:
a.grzanka@ise.pw.edu.pl).
shifted to higher frequencies [1]–[4]. The main goal of
research is to enhance the speech to make it more intelligible
for listeners. In [5] authors developed special-purpose DSP
hardware unit which replaced voicing sources of esophageal
speech using the formant analysis-synthesis approach. The
authors of [6] used digital signal processing technique based
on modeling radiated pulses in frequency domain.
C. Multimodal Speech Recognition
It is hard to perform the automatic recognition of sustained
vowels in pseudo-whisper speech using standard acoustic
features (F1 and F2 formants) [1]. The authors concluded
that supplementary information is essential for vowels
recognition in pathological speech. Several multimodal
approaches have been used for laryngectomees’ speech
recognition. The example application with the magnets
placed on the lips, teeth and tongue was presented in [7].
Miniature device with magnetic sensors reads the speech
according to vocabulary database prepared for each patient.
Apart from variety esophageal speech analysis and
enhancement approaches, considerable attention has been
paid to methods of acoustical and visual modalities fusion.
Although visual signals provide only partial, ambiguous
description of the vocal tract this information is
complementary to the standard acoustic features. Most of
applications combining the visual and auditory modalities
are applied to noisy environment [8]–[13]. Pathological
speech of laryngectomees can be even more complex
problem than noisy environment. In earlier research authors
demonstrated that visual features are promising candidates
for pathological speech recognition [14], [15]. The visual
modality is important factor in pseudowhisper voice
enhancement..
D. Aims
Authors developed a computerized measurement station to
achieve speech factors of laryngectomees [1]. The computer
program was developed to evaluate the progress of patient’s
rehabilitation process. The aim of presented study was too
find correlations between video and audio features.
II. METHODS
A. Subjects
Twelve males who underwent total laryngectomy and 12
volunteers (7 males) participated in the research. As a
material we used pronunciation of 6 sustained Polish vowels
’a’, ’i’, ’e’, ’y’, ’o’, ’u’(see appendix A for related IPA
symbols).
Combining Acoustic and Visual Modalities in
Vowel Recognition System for Laryngectomees
Rafal Pietruch, and Antoni Grzanka
L