Subjective and objective measurement of synthesized speech intelligibility in modern telephone conditions Peter Poc ˇta a,⇑ , John G. Beerends b a Dept. of Telecommunications and Multimedia, FEE, University of Z ˇ ilina, SK-01026 Z ˇ ilina, Slovakia b TNO, P.O. Box 96800, NL-2509 JE The Hague, The Netherlands Received 27 January 2015; received in revised form 19 March 2015; accepted 1 April 2015 Available online 9 April 2015 Abstract This paper investigates the impact of diﬀerent telephone channels, represented by impairments as introduced by modern telecommu- nication networks (e.g. speech coding, bandwidth limitation, packet loss, etc.), on the intelligibility of synthesized speech. Both subjective and objective assessments are used. Two diﬀerent speech intelligibility prediction models, namely PESQ Intelligibility and POLQA Intelligibility, are evaluated by comparing the predictions with subjectively obtained intelligibility scores. The results show that all the investigated degradations seriously impact the intelligibility of the synthesized speech measured subjectively. Furthermore it is shown that PESQ Intelligibility provides too low correlations between predicted objective measurements and subjective scores for accurate pre- diction of speech intelligibility while POLQA Intelligibility is capable of providing good intelligibility predictions in the case that a closed response experimental set up is used. Ó 2015 Elsevier B.V. All rights reserved. Keywords: Speech intelligibility; Synthesized speech; Telecom degradations; CVC test; POLQA Intelligibility; PESQ Intelligibility 1. Introduction In recent years, synthesized speech has reached a level of quality which allows it to be integrated into many real-life applications, e.g. e-mail and SMS readers, etc. In particu- lar, Text-to-Speech (TTS) can fruitfully be used in systems enabling interaction with an information database or a transaction server, e.g. via the telephone network. Modern telephone networks, however, introduce a num- ber of degradations which have to be taken into account when services are planned and developed. The type of degradation depends on the speciﬁc network under con- sideration. In traditional, connection-based (analogue or digital) networks, loss of loudness, frequency distortion and noise are the most signiﬁcant degradations. In contrast, new types of networks (e.g. mobile or IP-based ones) introduce impairments which are perceptively diﬀer- ent from the traditional ones. Examples are non-linear dis- tortions from low bit-rate coding–decoding processes (codecs), overall delay due to signal processing equipment, talker echoes resulting from the delay in conjunction with acoustic or electrical reﬂections, or time-variant degrada- tions when packets or frames get lost on the digital chan- nel. A combination of all these impairments will be encountered when diﬀerent networks are interconnected to form a transmission path from the service provider to the user. Thus, the whole path has to be taken into account in order to determine the overall intelligibility of the trans- mission network. To quantify the intelligibility of a speech transmission chain, a number of measurement techniques have been developed during the past decades. In the subjective domain, examples are the consonant–vowel–consonant http://dx.doi.org/10.1016/j.specom.2015.04.001 0167-6393/Ó 2015 Elsevier B.V. All rights reserved. ⇑ Corresponding author. www.elsevier.com/locate/specom Available online at www.sciencedirect.com ScienceDirect Speech Communication 71 (2015) 1–9