Broad Phoneme Class Recognition in Noisy Environments Using the GEMS Device Cenk Demiroglu and David V. Anderson Department of Electrical and Computer Engineering Georgia Institute of Technology, USA demirogc,dva@ece.gatech.edu Abstract Broad phoneme class recognition has the advantage of of- fering additional acoustic-phonetic knowledge to the speech processing applications. In several papers, exploiting such information is shown to be adventagous for HMM-based speech enhancement systems. The problem with those sys- tems is the dramatic decrease in recognition accuracy in noisy environments. In this work, we extract the energy fea- ture from an auxillary sensor and directly fuse it with the features extracted from the speech signal. Experiment results with noisy speech show significant increase in performance. 1. Introduction Recent advances in the sensor technology have significant impact in many engineering problems. Speech processing applications is one of those fields that can hugely benefit from this technology. In this work, the general electromag- netic sensor (GEMS) device is used to address the problem of noise robust broad phoneme class recognition. Broad phoneme class recognition can be useful in vari- ous speech applications. For example, HMM-based speech enhancement [1] can be performed that uses the acoustic- phonetic knowledge of the phoneme classes once those classes are recognized. Unfortunately, recognition accuracy in noisy environments is typically too low to exploit the full potential of those systems. Realizing the importance of solv- ing the noise robustness problem of such systems, we attack the old problem armed with the new armor, the auxillary sen- sors. Auxillary sensors have been used, especially recently, in the speech field. In [2], [3], several speech enhancement al- gorithms are described that use the auxillary sensors. In [4], bone conduction microphone is for enhancing the speech sig- nal. In [5], the throat microphone is used to enhance the noisy features extracted from the noisy acoustic microphone by probabilistic optimum filtering. Our approach is novel in the sense that we do not use the auxillary sensor for speech The GEMS speech coding work is sponsored by the Defense Advanced Research Projects Agency under Contract N00024-02-C-6339, and this pa- per has been designated ”Approved for public release, distribution un- limited.” Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the US Govern- ment. enhancement, but we directly extract features from it and fuse those features with the acoustic microphone’s features. In this work, we used the general electromagnetic sensor (GEMS) device that can provide information on glottal air- flow velocity once directed on the throat [6]. Its key property is being relatively robust to ambient noise. Thus, it can open exciting opportunities for various noise robust speech pro- cessing applications. There are systems that extract the voic- ing information from the GEMS sensor and fuse that feature with the acoustic feature vector [7]. Those systems extract a single feature from the sensor, and they can not fully exploit the information in the signal. This paper is organized as follows. A brief description of the GEMS device is done in Section 2. The recognition framework is discussed in Section 3. The corpus that is used for experiments is described in Section 4. Experimental re- sults are presented in Section 5, and the paper is concluded in Section 6. 2. The Auxillary Sensor In this work, an additional sensor is used to provide noise- robust information to the recognition system. Several sen- sors exist which would work well in this category including throat accelerometers, physiological microphones (p-mics), bone-conduction microphones, or electromagnetic glottal or vibration sensors. All of these have a low-pass characteris- tic and most do not do a good job of reproducing vocal-tract modulation of the glottal spectrum. However, all of them can be used to identify voicing and pitch/harmonic. For this pa- per, we report on results generated using the GEMS device from Aliph, Inc. The general electromagnetic sensor (GEMS), is a micro- power device that can be used, among other things, to detect motion in the region of glottis. The GEMS device consists of a penetrating radar whose principles have been studied extensively both at the Lawrence-Livermore Laboratory and Aliph, Inc. Descriptions of its properties can be found in [2]. When positioned correctly on the exterior of the throat adjacent to the glottis, the output of the radar during voiced speech is a signal that resembles an ideal excitation wave- form. The GEMS device responds to vocal fold vibration at the larynx. The signal obtained is robust to external acoustic influences, such as noise, and it can be used for applications 1805 0-7803-8622-1/04/$20.00 ©2004 IEEE