A Study on an LP-based Model for Restoring Bone-conducted Speech Thang tat Vu, Masashi Unoki, and Masato Akagi School of Information Science, Japan Advanced Institute of Science and Technology 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan Email: {vu-thang, unoki, akagi} Ijaist.ac.jp Abstract. In a highly noisy environment, bone-conducted speech seems to be more advantageous than normal noisy speech because of its stability against surrounding noise. The sound quality of bone-conducted speech, however, is very low and restoring bone-conducted speech is a challenging new topic in speech signal processing field. In this paper, we propose a restoration model based on linear prediction (LP). To evaluate the ability of the LP-based model to improve the voice quality, we compared it with existing models using one subjective and three objective measurements. The experiments showed that the LP- based model yields restored signals that are better for both human hearing and for the front-ends of automatic speech recognition systems. As the restoration ability of the LP-based model depended on a few parameters related to the LP coefficients of air-conducted speech, we applied a multi-layer perceptron neural network to blindly predict them with reasonable results. Keywords: Bone-conducted (BC) speech, Air-conducted (AC) speech, Linear prediction (LP), Speech intelligibility. I. INTRODUCTION The sound quality and intelligibility of speech are influenced by the transmission environment. In a highly noisy environment, it is very difficult for automatic speech recognition (ASR) systems as well as for humans to accomplish speech communication. As a solution, there are many different complicated models and/or algorithms which are used for canceling or reducing interfering noise. These are efficient only at low and medium noise levels and are ineffective when the noise level is too high. Another possible solution is to use a special microphone to record the speech signal transmitted through the speaker's head and face. This recorded signal is referred to as "bone- conducted (BC) speech". Its stability against interfering noise from a noisy environment seems to make BC speech more advantageous than noisy air-conducted (AC) speech. Although BC speech is not affected by external noise while AC speech is, there is a drawback to using BC speech: when the signal is transmitted throughout bone conduction, it is attenuated in a complex manner. This causes the voice quality of BC speech, which means both the intelligibility for human hearing systems and the robust features for ASR systems, to be very low. If the voice quality of BC speech can be improved, the restored signal can be applied to speech applications in noisy environments with greater efficiency instead of using noisy AC speech. Such applications include human hearing aids and machine hearing systems. Since it is very difficult to blindly restore BC speech, this is a challenging new topic in the speech signal processing field. The attenuation of the BC speech signal varied for different measurement positions (positions of the BC microphone), speakers, and pronounced syllables. This is because the characteristics of bone-conduction change for different measurement positions, and the distribution of frequency components varies with speakers and different pronounced syllables. In general, this attenuation is strong at high frequencies and it seems to be lowpass filtering with a cut-off frequency of about 1 kHz [1]. The straightforward method for restoring BC speech is to emphasize these attenuated frequency components, such as by using a highpass filtering (the inverse of the lowpass filtering described above). However, it is difficult to adequately design one unique highpass filtering that is independent of the pronounced words, speakers, and measurement positions. There are some other methods for deriving an inverse filtering such as the cross-spectrum method [2] and the long-term Fourier transform [3, 4], but these yield restored signals with artifacts such as musical noise and echoes, so the improvement in voice quality is small. In our previous study, we proposed an MTF-based model to restore a BC speech signal by compensating for the reduced value of the power envelope in the filterbank model [6]. It overcame the drawback of previous methods and yielded a restored signal of better quality. However, since the aim of that model was to achieve BC restoration for human hearing, there was no consideration of improvement for ASR systems. In this paper, we propose a restoration model based on linear prediction (LP) and report our investigation of its ability to improve the voice quality of BC speech. This model originates from the idea that the information corresponding to the source (glottal) characteristics as the LP residue is the same for both BC and AC speech signals. Therefore, the adaptive inverse filtering will be derived primarily from the LP coefficients, which are related to filter information. This model is expected to yield a restored signal that is not only more intelligible to human hearing systems but also enables ASR systems to achieve higher performance. Information about AC speech is needed to construct the inverse filtering, as in all previous models [1, 2, 3, 4, and 6]. In this case, the LP-based model also depended on a few 294 1-4244-0569-6/06/$20.00 ©2006 IEEE