CAN STANDARD ANALYSIS TOOLS BE USED ON
DECOMPRESSED SPEECH?
R.J.J.H. van Son
*
Institute of Phonetic Sciences/ACLC
University of Amsterdam
Herengracht 338, 1016CG Amsterdam
Rob.van.Son@hum.uva.nl
Abstract
This paper quantifies some of the effects of "lossy" audio compression on basic acoustic speech analysis procedures by
comparing original audio-CD speech recordings to compressed/decompressed versions of these recordings. The
differences found are benchmarked against the effects of a change of microphone. Tested are a Sony Minidisc Walkman
recorder and two audio compression codecs: Ogg Vorbis 1.0rc3 and LAME 3.92 (MP3), with 3 bit rates: 40, 80 (Ogg
Vorbis), and 192 kbs (MP3). These are tested against pitch and formant extraction and spectral center of gravity (i.e.,
first spectral moment). Audio compression added only a limited amount of "jump errors" ( 3%) to vowel pitch and
formant measurements. Only small systematic effects on measurements were found that could be attributed to
compression. However, rather large systematic effects resulted from a switch of microphone, mostly on the spectral
center of gravity. The audio compression algorithms introduced a Root-Mean-Square (RMS) error, after removing jump
errors, of less than 1 semitone to vowel mid-point pitch, formant, and CoG measurements. The effect of the microphone
change on RMS error was as large, i.e., for pitch, or larger, i.e., >1.2 semitones for formants and center of gravity.
Comparison of the pitch in sonorants and the spectral center of gravity measurements in continuants showed that here
too, the RMS errors introduced by the audio compression were always less than 1 semitone, except for the lowest bit-rate,
40 kbs, where CoG errors exploded in vowel-like consonants and fricatives (> 2 semitones). The size of the errors shows
an effect of compression factor (bit-rate). The higher bit-rate encodings always had smaller RMS errors, except for pitch
measurements where there was no effect of encoding or bit-rate whatsoever. When audio compression is applied
repeatedly, e.g., during recording, distribution, and archiving, the weakest link determines the total RMS error for pitch
and formant measurements. However, the total RMS error of the CoG measurements is the sum of the component errors.
It is concluded that Minidisc recordings and compressed speech of bit-rates from 80 kbs and up can be used for acoustic
speech analysis if an increased RMS error of up to 1 semitone is acceptable. A low bit-rate encoding of 40 kbs introduces
markedly larger errors in formant measurements and must be considered unsuitable for whole-spectrum measurements
like the CoG. Repeatedly compressed speech is still useful for pitch and formant measurments, but whole spectrum (e.g.,
CoG) measurements should only be used with care.
1. Introduction
High quality "lossy" audio compression has revolutionized the distribution and storage of music. It could do the same
for speech corpora. What is slowing down this revolution is uncertainty about the reliability of speech analysis tools
when used on decompressed speech. Potential users fear that compression could introduce artifacts or biases in acoustic
analysis results. This paper will try to quantify the effects of three popular compression algorithms with respect to basic
analysis types: Pitch and Formant extraction and spectral center of gravity (CoG, first spectral moment) determination.
A range of software has become available that can compress sound recordings efficiently at very high perceptual
quality. Cheap, robust, and light-weight equipment based on this software is now available for making high quality
recordings. The best known device today is the Sony Minidisc. But similar high-quality devices based on MP3 or Ogg
Vorbis compression standards are becoming available. As a result, many projects that record speech in natural situations
have used, or plan to use, Sony Minidisc equipment to record speech. Examples are the Spoken Dutch Corpus (CGN,
Oostdijk et al., 2002; c.f., http://lands.let.kun.nl/cgn/ehome.htm) and the collection of expressive speech by the Japanese
JST/CREST ESP Project (c.f., Campbell, 2002a & b). The Sony Minidisc is an almost ideal device for such "field"
recordings. It is cheap, small, light-weighted, and can run on standard batteries. Moreover, it can be carried around and
operated by volunteers in everyday situations without the need for technical assistance.
The compression algorithm used in the Sony Minidisc, ATRAC3, is one of a class of lossy compression algorithms
that remove redundant information from the sound spectrum according to a psycho-acoustic model of human hearing.
Two other popular standards in this class of algorithms are MP3 (Mpeg-1 layer 3) and Ogg Vorbis. Perceptually, the
decompressed audio of all these algorithms is of very high quality. Naive listeners are often unable to hear the difference
between decompressed and original sounds. However, common speech analysis algorithms are not based on a model of
human speech recognition, but on a (simplified acoustic) model of speech production. Therefore, it is very well possible
that the compression algorithms remove spectro-temporal information that is irrelevant for speech perception, but
*Copyright © 2002 R.J.J.H. van Son.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any
later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.
A copy of the license is available from the author (see address above) or from the GNU project http://www.gnu.org/licenses/fdl.html.