SUBJECTIVE RATINGS OF INSTANTANEOUS AND GRADUAL TRANSITIONS FROM NARROWBAND TO WIDEBAND ACTIVE SPEECH Stephen D. Voran Institute for Telecommunication Sciences 325 Broadway, Boulder, Colorado, USA svoran@its.bldrdoc.gov ABSTRACT In advanced heterogeneous telecommunication networks, network resources can dynamically dictate the type of speech coding that is used. An increase in resources allows for lower coding distor- tion or it might also be used to provide wideband speech instead of narrowband speech. Existing studies have demonstrated that wide- band speech is preferred to narrowband speech, but they have also demonstrated that an abrupt transition from narrowband to wide- band is perceived as an impairment, even though it is a transition to a higher quality signal. We describe our recent work that resulted in subjective scores for abrupt and gradual transitions from narrow- band to wideband at the midpoint of a six-second segment of active speech. On average, signals that start narrowband and end wideband are rated slightly lower than constant narrowband signals and results are nearly the same for abrupt and gradual (2.5 second) transitions. Scores from 20 listeners show a wide range of individual opinions so we conclude that studies of bandwidth transitions may be quite sensitive to the listener population sample. Index Terms— Narrowband speech, speech coding, subjective testing, wideband speech 1. BACKGROUND AND MOTIVATION Telecommunication networks are simultaneously becoming less ho- mogeneous and more adaptive. This makes it more likely that the network resources available to support a given call can change dur- ing the call, especially if one or more call participants are mobile. If network resources increase and additional data capacity is avail- able then it is possible to switch to higher bit rate speech coding. One might keep the encoded speech bandwidth ﬁxed and use the ex- tra bits to reduce coding distortion. Another possibility is to switch from narrowband (NB) to wideband (WB) speech coding. Unless otherwise indicated, we use what we consider to be the canonical deﬁnitions of speech passbands, based on the original NB and WB digital speech coders. Thus NB indicates a speech passband of 300 to 3400 Hz consistent with the minimum -3 dB bandwidth for G.711 PCM speciﬁed in [1] and WB refers to a speech passband of 50 to 7000 Hz as given in [2]. Several studies, including [3], have shown that WB is preferred to NB in controlled subjective experiments. Note that a transition from NB to WB entails both a low frequency extension (LFE) and a high frequency extension (HFE). The study in [3] shows that the HFE alone does not enhance perceived speech quality but the LFE alone does. If the LFE is in place then the HFE further enhances speech quality. Consistent with this ﬁnding, more recent NB/WB speech coders include most or all of the LFE in the NB mode, and switching to WB entails mainly HFE. The speciﬁcations for the G.729.1 speech coding algorithm indicate a nominal NB bandwidth of 50 to 4000 Hz and a nominal WB bandwidth of 50 to 7000 Hz [4]. Measurements of the AMR speech coders show NB -3 dB bandwidth from 85 Hz up to 2800 to 3600 Hz (the upper limit depends on AMR mode) [5], and WB bandwidth from 50 Hz up to 5700 to 6600 Hz (the upper limit depends on mode) [6]. Given that listeners prefer WB to NB, it might seem logical to have a telecommunication system switch from NB to WB speech coding as soon as network resources become available. Earlier work shows that if listeners hear quality Q low for a total of (1 - α) × T seconds and quality Q high for a total of α × T seconds, then as α goes from 0 to 1 the overall rating of the experience increases monotonically from Q low to Q high [7]. From this result we might expect a signal with both NB and WB portions to receive a quality rating between the constant NB rating and the constant WB rating, and thus there would be at least some improvement associated with switching to WB whenever possible. We note however that results in [7] were based on 3 second signals and at this short time-scale listeners were not conscious of the transitions between quality levels. In addition, this work did not use any bandwidth transitions. More recently, a team of researchers developing and evaluating handoff strategies for telecommunications over wireless networks has included NB/WB switching in a set of important experiments [8] [9]. This is a rich body of work and it has revealed much about speech quality associated with handoffs, packet loss, bandwidth switching, and relationships among those factors. Here we focus only on the bandwidth switching aspect of this work. Included in [8] and [9] are subjective tests with signals that switch between an NB speech coder (G.711) and a WB speech coder (AMR-WB also named G.722.2) in conjunction with a net- work handoff. Results indicate that the handoffs themselves do not hurt perceived speech quality but the bandwidth switching can hurt perceived speech quality. Speciﬁcally, switching from NB to WB coding at the midpoint of a six second recording results in a lower score (mean opinion score or MOS near 3.4) than a constant NB version of the recording (MOS near 3.9). WB speech coding was rated to have MOS near 5.0. In 60 second tests, results show that switching from NB to WB near the 15 or 30 second point results in a small score increase rela- tive to constant NB, but switching near the 45 second mark results in a small decrease. The conclusions are that the switch from NB cod- ing to WB coding is perceived as an impairment, even though it is a transition to a higher quality speech signal. If this impairment hap- pens early enough in the signal, it can be outweighed by the higher quality of WB in the remainder of the signal. The notion that bandwidth switching is at least a minor impair- 4674 U.S. Government Work Not Protected by U.S. Copyright ICASSP 2010