A Study on an LP-based Model for Restoring
Bone-conducted Speech
Thang tat Vu, Masashi Unoki, and Masato Akagi
School of Information Science,
Japan Advanced Institute of Science and Technology
1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Email: {vu-thang, unoki, akagi}
Ijaist.ac.jp
Abstract. In a highly noisy environment, bone-conducted speech
seems to be more advantageous than normal noisy speech
because of its stability against surrounding noise. The sound
quality of bone-conducted speech, however, is very low and
restoring bone-conducted speech is a challenging new topic in
speech signal processing field. In this paper, we propose a
restoration model based on linear prediction (LP). To evaluate
the ability of the LP-based model to improve the voice quality, we
compared it with existing models using one subjective and three
objective measurements. The experiments showed that the LP-
based model yields restored signals that are better for both
human hearing and for the front-ends of automatic speech
recognition systems. As the restoration ability of the LP-based
model depended on a few parameters related to the LP
coefficients of air-conducted speech, we applied a multi-layer
perceptron neural network to blindly predict them with
reasonable results.
Keywords: Bone-conducted (BC) speech, Air-conducted (AC)
speech, Linear prediction (LP), Speech intelligibility.
I. INTRODUCTION
The sound quality and intelligibility of speech are
influenced by the transmission environment. In a highly noisy
environment, it is very difficult for automatic speech
recognition (ASR) systems as well as for humans to
accomplish speech communication. As a solution, there are
many different complicated models and/or algorithms which
are used for canceling or reducing interfering noise. These are
efficient only at low and medium noise levels and are
ineffective when the noise level is too high.
Another possible solution is to use a special microphone to
record the speech signal transmitted through the speaker's head
and face. This recorded signal is referred to as "bone-
conducted (BC) speech". Its stability against interfering noise
from a noisy environment seems to make BC speech more
advantageous than noisy air-conducted (AC) speech. Although
BC speech is not affected by external noise while AC speech
is, there is a drawback to using BC speech: when the signal is
transmitted throughout bone conduction, it is attenuated in a
complex manner. This causes the voice quality of BC speech,
which means both the intelligibility for human hearing systems
and the robust features for ASR systems, to be very low. If the
voice quality of BC speech can be improved, the restored
signal can be applied to speech applications in noisy
environments with greater efficiency instead of using noisy AC
speech. Such applications include human hearing aids and
machine hearing systems. Since it is very difficult to blindly
restore BC speech, this is a challenging new topic in the speech
signal processing field.
The attenuation of the BC speech signal varied for different
measurement positions (positions of the BC microphone),
speakers, and pronounced syllables. This is because the
characteristics of bone-conduction change for different
measurement positions, and the distribution of frequency
components varies with speakers and different pronounced
syllables. In general, this attenuation is strong at high
frequencies and it seems to be lowpass filtering with a cut-off
frequency of about 1 kHz [1]. The straightforward method for
restoring BC speech is to emphasize these attenuated frequency
components, such as by using a highpass filtering (the inverse
of the lowpass filtering described above). However, it is
difficult to adequately design one unique highpass filtering that
is independent of the pronounced words, speakers, and
measurement positions. There are some other methods for
deriving an inverse filtering such as the cross-spectrum method
[2] and the long-term Fourier transform [3, 4], but these yield
restored signals with artifacts such as musical noise and
echoes, so the improvement in voice quality is small.
In our previous study, we proposed an MTF-based model to
restore a BC speech signal by compensating for the reduced
value of the power envelope in the filterbank model [6]. It
overcame the drawback of previous methods and yielded a
restored signal of better quality. However, since the aim of that
model was to achieve BC restoration for human hearing, there
was no consideration of improvement for ASR systems.
In this paper, we propose a restoration model based on
linear prediction (LP) and report our investigation of its ability
to improve the voice quality of BC speech. This model
originates from the idea that the information corresponding to
the source (glottal) characteristics as the LP residue is the same
for both BC and AC speech signals. Therefore, the adaptive
inverse filtering will be derived primarily from the LP
coefficients, which are related to filter information. This model
is expected to yield a restored signal that is not only more
intelligible to human hearing systems but also enables ASR
systems to achieve higher performance.
Information about AC speech is needed to construct the
inverse filtering, as in all previous models [1, 2, 3, 4, and 6]. In
this case, the LP-based model also depended on a few
294
1-4244-0569-6/06/$20.00 ©2006 IEEE