Broad Phoneme Class Recognition in Noisy Environments Using the GEMS Device
Cenk Demiroglu and David V. Anderson
Department of Electrical and Computer Engineering
Georgia Institute of Technology, USA
demirogc,dva@ece.gatech.edu
Abstract
Broad phoneme class recognition has the advantage of of-
fering additional acoustic-phonetic knowledge to the speech
processing applications. In several papers, exploiting such
information is shown to be adventagous for HMM-based
speech enhancement systems. The problem with those sys-
tems is the dramatic decrease in recognition accuracy in
noisy environments. In this work, we extract the energy fea-
ture from an auxillary sensor and directly fuse it with the
features extracted from the speech signal. Experiment results
with noisy speech show significant increase in performance.
1. Introduction
Recent advances in the sensor technology have significant
impact in many engineering problems. Speech processing
applications is one of those fields that can hugely benefit
from this technology. In this work, the general electromag-
netic sensor (GEMS) device is used to address the problem
of noise robust broad phoneme class recognition.
Broad phoneme class recognition can be useful in vari-
ous speech applications. For example, HMM-based speech
enhancement [1] can be performed that uses the acoustic-
phonetic knowledge of the phoneme classes once those
classes are recognized. Unfortunately, recognition accuracy
in noisy environments is typically too low to exploit the full
potential of those systems. Realizing the importance of solv-
ing the noise robustness problem of such systems, we attack
the old problem armed with the new armor, the auxillary sen-
sors.
Auxillary sensors have been used, especially recently, in
the speech field. In [2], [3], several speech enhancement al-
gorithms are described that use the auxillary sensors. In [4],
bone conduction microphone is for enhancing the speech sig-
nal. In [5], the throat microphone is used to enhance the
noisy features extracted from the noisy acoustic microphone
by probabilistic optimum filtering. Our approach is novel in
the sense that we do not use the auxillary sensor for speech
The GEMS speech coding work is sponsored by the Defense Advanced
Research Projects Agency under Contract N00024-02-C-6339, and this pa-
per has been designated ”Approved for public release, distribution un-
limited.” Opinions, interpretations, conclusions, and recommendations are
those of the authors and are not necessarily endorsed by the US Govern-
ment.
enhancement, but we directly extract features from it and
fuse those features with the acoustic microphone’s features.
In this work, we used the general electromagnetic sensor
(GEMS) device that can provide information on glottal air-
flow velocity once directed on the throat [6]. Its key property
is being relatively robust to ambient noise. Thus, it can open
exciting opportunities for various noise robust speech pro-
cessing applications. There are systems that extract the voic-
ing information from the GEMS sensor and fuse that feature
with the acoustic feature vector [7]. Those systems extract a
single feature from the sensor, and they can not fully exploit
the information in the signal.
This paper is organized as follows. A brief description
of the GEMS device is done in Section 2. The recognition
framework is discussed in Section 3. The corpus that is used
for experiments is described in Section 4. Experimental re-
sults are presented in Section 5, and the paper is concluded
in Section 6.
2. The Auxillary Sensor
In this work, an additional sensor is used to provide noise-
robust information to the recognition system. Several sen-
sors exist which would work well in this category including
throat accelerometers, physiological microphones (p-mics),
bone-conduction microphones, or electromagnetic glottal or
vibration sensors. All of these have a low-pass characteris-
tic and most do not do a good job of reproducing vocal-tract
modulation of the glottal spectrum. However, all of them can
be used to identify voicing and pitch/harmonic. For this pa-
per, we report on results generated using the GEMS device
from Aliph, Inc.
The general electromagnetic sensor (GEMS), is a micro-
power device that can be used, among other things, to detect
motion in the region of glottis. The GEMS device consists
of a penetrating radar whose principles have been studied
extensively both at the Lawrence-Livermore Laboratory and
Aliph, Inc. Descriptions of its properties can be found in [2].
When positioned correctly on the exterior of the throat
adjacent to the glottis, the output of the radar during voiced
speech is a signal that resembles an ideal excitation wave-
form. The GEMS device responds to vocal fold vibration at
the larynx. The signal obtained is robust to external acoustic
influences, such as noise, and it can be used for applications
1805 0-7803-8622-1/04/$20.00 ©2004 IEEE