CAN STANDARD ANALYSIS TOOLS BE USED ON DECOMPRESSED SPEECH? R.J.J.H. van Son * Institute of Phonetic Sciences/ACLC University of Amsterdam Herengracht 338, 1016CG Amsterdam Rob.van.Son@hum.uva.nl Abstract This paper quantifies some of the effects of "lossy" audio compression on basic acoustic speech analysis procedures by comparing original audio-CD speech recordings to compressed/decompressed versions of these recordings. The differences found are benchmarked against the effects of a change of microphone. Tested are a Sony Minidisc Walkman recorder and two audio compression codecs: Ogg Vorbis 1.0rc3 and LAME 3.92 (MP3), with 3 bit rates: 40, 80 (Ogg Vorbis), and 192 kbs (MP3). These are tested against pitch and formant extraction and spectral center of gravity (i.e., first spectral moment). Audio compression added only a limited amount of "jump errors" ( 3%) to vowel pitch and formant measurements. Only small systematic effects on measurements were found that could be attributed to compression. However, rather large systematic effects resulted from a switch of microphone, mostly on the spectral center of gravity. The audio compression algorithms introduced a Root-Mean-Square (RMS) error, after removing jump errors, of less than 1 semitone to vowel mid-point pitch, formant, and CoG measurements. The effect of the microphone change on RMS error was as large, i.e., for pitch, or larger, i.e., >1.2 semitones for formants and center of gravity. Comparison of the pitch in sonorants and the spectral center of gravity measurements in continuants showed that here too, the RMS errors introduced by the audio compression were always less than 1 semitone, except for the lowest bit-rate, 40 kbs, where CoG errors exploded in vowel-like consonants and fricatives (> 2 semitones). The size of the errors shows an effect of compression factor (bit-rate). The higher bit-rate encodings always had smaller RMS errors, except for pitch measurements where there was no effect of encoding or bit-rate whatsoever. When audio compression is applied repeatedly, e.g., during recording, distribution, and archiving, the weakest link determines the total RMS error for pitch and formant measurements. However, the total RMS error of the CoG measurements is the sum of the component errors. It is concluded that Minidisc recordings and compressed speech of bit-rates from 80 kbs and up can be used for acoustic speech analysis if an increased RMS error of up to 1 semitone is acceptable. A low bit-rate encoding of 40 kbs introduces markedly larger errors in formant measurements and must be considered unsuitable for whole-spectrum measurements like the CoG. Repeatedly compressed speech is still useful for pitch and formant measurments, but whole spectrum (e.g., CoG) measurements should only be used with care. 1. Introduction High quality "lossy" audio compression has revolutionized the distribution and storage of music. It could do the same for speech corpora. What is slowing down this revolution is uncertainty about the reliability of speech analysis tools when used on decompressed speech. Potential users fear that compression could introduce artifacts or biases in acoustic analysis results. This paper will try to quantify the effects of three popular compression algorithms with respect to basic analysis types: Pitch and Formant extraction and spectral center of gravity (CoG, first spectral moment) determination. A range of software has become available that can compress sound recordings efficiently at very high perceptual quality. Cheap, robust, and light-weight equipment based on this software is now available for making high quality recordings. The best known device today is the Sony Minidisc. But similar high-quality devices based on MP3 or Ogg Vorbis compression standards are becoming available. As a result, many projects that record speech in natural situations have used, or plan to use, Sony Minidisc equipment to record speech. Examples are the Spoken Dutch Corpus (CGN, Oostdijk et al., 2002; c.f., http://lands.let.kun.nl/cgn/ehome.htm) and the collection of expressive speech by the Japanese JST/CREST ESP Project (c.f., Campbell, 2002a & b). The Sony Minidisc is an almost ideal device for such "field" recordings. It is cheap, small, light-weighted, and can run on standard batteries. Moreover, it can be carried around and operated by volunteers in everyday situations without the need for technical assistance. The compression algorithm used in the Sony Minidisc, ATRAC3, is one of a class of lossy compression algorithms that remove redundant information from the sound spectrum according to a psycho-acoustic model of human hearing. Two other popular standards in this class of algorithms are MP3 (Mpeg-1 layer 3) and Ogg Vorbis. Perceptually, the decompressed audio of all these algorithms is of very high quality. Naive listeners are often unable to hear the difference between decompressed and original sounds. However, common speech analysis algorithms are not based on a model of human speech recognition, but on a (simplified acoustic) model of speech production. Therefore, it is very well possible that the compression algorithms remove spectro-temporal information that is irrelevant for speech perception, but *Copyright © 2002 R.J.J.H. van Son. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is available from the author (see address above) or from the GNU project http://www.gnu.org/licenses/fdl.html.