UNIFIED SPEECH AND AUDIO CODING SCHEME FOR HIGH QUALITY AT LOW BITRATES M. Neuendorf 1 , P. Gournay 2 , M. Multrus 1 , J. Lecomte 1 , B. Bessette 2 , R. Geiger 1 , S. Bayer 1 , G. Fuchs 1 , J. Hilpert 1 , N. Rettelbach 1 , R. Salami 3 , G. Schuller 4 , R. Lefebvre 2 , B. Grill 1 1 Fraunhofer IIS, Erlangen, Germany, 2 University of Sherbrooke, Sherbrooke, Canada, 3 VoiceAge Corp., Montr´ eal, Canada, 4 Fraunhofer IDMT, Ilmenau, Germany ABSTRACT Traditionally, speech coding and audio coding were separate worlds. Based on different technical approaches and different assumptions about the source signal, neither of the two coding schemes could ef- ﬁciently represent both speech and music at low bitrates. This paper presents a uniﬁed speech and audio codec, which efﬁciently com- bines techniques from both worlds. This results in a codec that ex- hibits consistently high quality for speech, music and mixed audio content. The paper gives an overview of the codec architecture and presents results of formal listening tests comparing this new codec with HE-AAC(v2) and AMR-WB+. This new codec forms the basis of the reference model in the ongoing MPEG standardization activity for Uniﬁed Speech and Audio Coding. Index Terms— Audio coding, speech coding 1. INTRODUCTION With the increasing number of portable and wireless devices, there is a growing demand for low bitrate audio codecs. In several ap- plications, for example broadcasting, audiobooks and audio/video playback, the content can be varied and is not limited to speech or music only. Hence, a uniﬁed audio codec that can deal equally well with all types of audio content is highly desired. Audio coding schemes, such as MPEG-4 High Efﬁciency AAC (HE-AAC) [1, 2], are advantageous in that they show a high per- ceived quality at low bitrates for music signals. However, the subband and transform-based models used in such audio coding schemes do not perform well on speech signals, i.e. they can not use a small bit budget as efﬁciently as linear predictive (LP) coders when encoding speech. LP coding (or LPC), and in particular CELP coding, is well suited for representing speech at low bitrates. The excitation-ﬁlter paradigm in LP coders closely follows the speech production pro- cess. State-of-the-art speech coders include the 3GPP AMR-WB standard [3, 4], which can produce high quality wideband speech at less than 1 bit per sample. In general, speech coding schemes show a high quality for speech even at low bitrates, but show a poor quality for music. Attempts to unify speech and audio coding were made by the 3GPP AMR-WB+ standard [5, 6]: The AMR-WB speech-coder was extended by selectable frequency domain coding and a stereo mode. In this way, the capability of coding music was signiﬁcantly im- proved. But still, the AMR-WB+ audio coding model is not as opti- mal as HE-AAC(v2) for music signals. In [7] a uniﬁed speech and audio codec was built by combining AMR-WB and HE-AAC. However, for speech signals the perfor- mance of AMR-WB was not preserved. In this paper, a new coding model is presented, which retains all the advantages of state-of-the-art speech and audio codecs. Tech- niques from both HE-AAC and AMR-WB+ are combined in order to allow seamless switching between a more general music coding mode, and a speech-speciﬁc coding mode. Formal listening tests show that the resulting codec is for each signal category at least as good as the better of HE-AAC(v2) and AMR-WB+, and thus the goal of a uniﬁed speech and audio codec, as stated in [8], is reached. 2. STATE OF THE ART 2.1. HE-AAC(v2) and MPEG Surround Frequency domain coding schemes such as AAC [1] are based on three main steps: (1) a time/frequency conversion; (2) a subsequent quantization stage, in which the quantization error is controlled us- ing information from a psychoacoustic model; and (3) an encoding stage, in which the quantized spectral coefﬁcients and corresponding side information are entropy-encoded using code tables. This results in a source-controlled, variable-rate codec which adapts to the input signal statistics as well as to the characteristics of human perception. To further reduce the bitrate, HE-AAC combines an AAC core in the low frequency band with a parametric coding approach for the high frequency band (SBR) [2]. The high frequency band is recon- structed from replicated low frequency signal portions, controlled by parameter sets containing level, noise and tonality parameters. Although HE-AAC has generic multi-channel capabilities, it can also be combined with a joint stereo or a multi-channel coding tool to further reduce the bitrate. The combination of “Parametric Stereo” [1, 9] and HE-AAC is known as HE-AACv2 and is capable of repre- senting stereo signals by a mono downmix and corresponding sets of inter-channel level, phase and correlation parameters. By usage of “MPEG Surround” [9, 10] this principle is extended to transmit N audio input channels via M transmission channels (where N ≥ M) and corresponding parameter sets. 2.2. AMR-WB and AMR-WB+ Efﬁcient speech coding schemes, such as AMR-WB, typically have three major components: (1) a short-term linear prediction (LP) ﬁl- ter, which models the vocal tract; (2) a long-term prediction (LTP) ﬁlter, which models the periodicity in the excitation signal from the vocal chords; and (3) an innovation codebook, which essentially en- codes the non-predictive part of the speech signal. In AMR-WB, the innovative codebook uses the ACELP model. In ACELP, a short block of excitation signal is encoded as a sparse set of pulses and associated gain for the block. The gain, signs and positions of the pulses are found in a closed-loop search (analysis-by-synthesis). The