1 Vowel intelligibility in Chinese Ian McLoughlin Abstract Conventional wisdom states that, since the average amplitude of vowel articulation signiﬁcantly exceeds that for consonants, an assessments of spoken intelligibility in obscuring noise should primarily be limited by consonant confusion. Furthermore, in both English and Chinese, consonant discrimination is considered to be more important to overall intelligibility than that of vowels. In the unbounded case, the assumption that vowel confusion is less important than consonant confusion may well be true, however at least two situations exist where the inﬂuence of vowel confusion may be greater. The ﬁrst is where vocabulary-speciﬁc restrictions conﬁne the structure of a particular spoken word to alternatives differing primarily in their vowel. The second is the prevalence of non AWGN interference, particularly impulsive noise which obscures only the vowel portion of a word, and similarly is present as a nonlinear effects of many time-sliced processing algorithms. This paper explores the issue of vowel intelligibility for spoken Chinese, where the confusion characteristics are complicated through the inﬂuence of lexical tone carried by the vowel in CVC structure utterances. Experimental evidence from multi-listener intelligibility testing are presented to build toward an understanding of the characteristics of Mandarin Chinese vowel confusion in the presence of AWGN. Results are also isolated by carrier word consonants and in terms of the lexical tone overlaid upon tested vowels. In particular, several factors relating to issues such as vowel length, tone combination and the crucial inﬂuence of the /a/ (IPA [A]) phone are revealed. Index Terms Mandarin, Chinese, intelligibility, tone, vowel, consonant I. I NTRODUCTION S PEECH-based technologies have become increasingly important over recent years, not least through the near ubiquitous availability of wireless voice technology such as mobile phones. Signiﬁcant social and economic wellbeing now depend upon quality speech communication over these networks, and as such any factors which reduce their quality are best minimised or eliminated altogether. Coincident with the expanding role of wireless voice technology has been the economic rise of China as an emerging world superpower. Rates of cellular telephony ownership in China are high, and growing. In all likelihood, there will soon come a time - if not already passed - when the majority of worldwide speech processing is operating on Mandarin speech 1 . In commercial terms, Mandarin speech communication is likely to constitute the worlds largest and most important telecommunications market. For these reasons, special emphasis has been paid to the intelligibility of Chinese speech, in particular in relation to the aspects of speech affected by those communications systems. Speech intelligibility assessment methods can be either subjective or objective. The former requires a group of human listeners, while the latter is typically conducted by automated systems. Tests can be made to evaluate either quality (how nice the speech sounds) or intelligibility (the ability to understand it). It is intelligibility evaluation which is the focus of the present paper: whilst perceived quality tends to sell systems, it is intelligibility which relates more closely to the ability to successfully conduct vocal communications. Subjective intelligibility testing in Chinese has been performed using the proposed Chinese Diagnostic Rhyme Test (CDRT) standard for the past decade, enhanced with additions to assess tone discrimination, and the resulting evolution into a combined New CDRT (NCDRT) test methodology [1]. These tests, each based around part of ANSI standard S3.2 [2], have been applied to evaluate several speech coders such as GSM 06.10 [3] and ITU G.728 [4] for the conveyance of Mandarin speech. The diagnostic rhyme test (DRT) A/B forced comparison method is one of the more popular intelligibility evaluation procedures enshrined in ANSI S3.2. This presents word pairs differing by a single attribute to listeners (see section III) [2], who are informed of two possible choices, and asked to select the correct one. Attributes which are mis-identiﬁed more often are classed as being more confused (or confusing) than others. Typically, two sets of word pairs are presented: one set has passed through a device under test (DUT), one has not. The difference between the confusion rates for attributes in each set is used to pinpoint the effect of the DUT on particular attributes. In the DRT, the differing attributes are simply the initial consonants from 96 rhyming word pairs. Published ﬁrst by Voiers [5] in 1983, and used by the author for many years, the DRT demonstrates good repeatability and accuracy. In particular, this measure of intelligibility is considered a good predictor of overall speech intelligibility for a given system. The NCDRT parallels the DRT methodology, but using Chinese words with a modiﬁed language-speciﬁc selection criteria. However the tonal nature of Chinese (see section II) which causes the understanding of Chinese words to be strongly dependent upon correct recognition of lexical tone, implies that measurement of consonant intelligibility alone is insufﬁcient to predict 1 The term ‘Mandarin’ is used to refer loosely to the majority Chinese dialect, and is used interchangeably with the term ‘Chinese’ in this paper.