Non-Audible Murmur recognition based on fusion of audio and visual streams Panikos Heracleous and Norihiro Hagita ATR, Intelligent Robotics and Communication Laboratories, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto-fu 619-0288, Japan {panikos,hagita}@atr.jp Abstract Non-Audible Murmur (NAM) is an unvoiced speech signal that can be received through the body tissue with the use of special acoustic sensors (i.e., NAM microphones) attached behind the talker’s ear. In a NAM microphone, body transmission and loss of lip radiation act as a low-pass filter. Consequently, higher fre- quency components are attenuated in a NAM signal. Owing to such factors as spectral reduction, the unvoiced nature of NAM, and the type of articulation, the NAM sounds become similar, thereby causing a larger number of confusions in comparison to normal speech. In the present article, the visual information ex- tracted from the talker’s facial movements is fused with NAM speech using three fusion methods, and phoneme classification experiments are conducted. The experimental results reveal a significant improvement when both fused NAM speech and fa- cial information are used. 1. Introduction Non-Audible Murmur (NAM) refers to a very softly uttered speech received through the body tissue. A special acoustic sen- sor (i.e., the NAM microphone) is attached behind the talker’s ear. This receives very soft sounds that are inaudible to other listeners who are in close proximity to the talker. The attachment of the NAM microphone to the talker is shown in Figure 1. The first NAM microphone was based on stethoscopes used by medical doctors to examine patients, and was called the stethoscopic microphone [1]. Stethoscopic mi- crophones were used for the automatic recognition of NAM speech [2]. The silicon NAM microphone is a more advanced version of the NAM microphone [3]. The silicon NAM mi- crophone is a highly sensitive microphone wrapped in silicon; silicon is used because its impedance is similar to that of hu- man skin. Silicon NAM microphones have been employed for automatic recognition of NAM speech as well as for NAM- to-speech conversion [6]. Similar approaches have been in- troduced for speech enhancement or speech recognition [4, 5]. Further, non-audible speech recognition has also been reported based on electromyographic (EMG) speech recognition, which processes electric signals caused by the articulatory muscles [7]. The speech received by a NAM microphone has different spectral characteristics in comparison to normal speech. In par- ticular, the NAM speech shows limited high-frequency contents because of body transmission. Frequency components above the 3500-4000 Hz range are not included in NAM speech. The NAM microphone can also be used to receive audible speech directly from the body (Body Transmitted Ordinary Speech (BTOS)). This enables automatic speech recognition in a con- ventional way while taking advantage of the robustness of NAM This work was supported by KAKENHI 21118003 project. Figure 1: NAM microphone attached to the talker against noise. Previous studies have reported experiments for NAM speech recognition that produced very promising results. A word accuracy of 93.9% was achieved for a 20k Japanese vo- cabulary dictation task when a small amount of training data from a single speaker was used [2]. Moreover, experiments were conducted using simulated and real noisy test data with clean training models to investigate the role of the Lombard reflex [8, 9] in NAM recognition. The HMM distances of NAM sounds in comparison with the HMM distances of nor- mal speech were also investigated, which indicated distance re- duction when NAM sounds were concerned [10]. In the same study, preliminary results obtained using audio-visual data for NAM recognition based on concatenative feature fusion were also reported. In the present study, audio-visual NAM recognition is fur- ther investigated by using the multistream HMM decision fu- sion and late fusion to integrate the audio and visual informa- tion. A statistical significance test was performed, and audio- visual NAM recognition in a noisy environment was also inves- tigated. 2. Methodology 2.1. Corpus and HMM modeling The corpus used in the experiment was 212 continuous Japanese utterances, containing 7518 phoneme realisations. A 3-state with no skip HMM topology was used. Forty-three mono- phones were trained using 5132 phonemes. For the purpose of testing, 2386 phonemes were used. The audio parameter vectors were of length 36 (12 MFCC, 12ΔMFCC, and 12 ΔΔMFCC). The HTK3.4 Toolkit was used for training and testing. Copyright 2010 ISCA 26 - 30 September 2010, Makuhari, Chiba, Japan INTERSPEECH 2010 2706 10.21437/Interspeech.2010-717