REVISITING GRAPHEMES WITH INCREASING AMOUNTS OF DATA Yun-Hsuan Sung †* , Thad Hughes * , Franc ¸oise Beaufays * , Brian Strope * * Google Inc., Mountain View CA Dept. of EE, Stanford University, Stanford CA ABSTRACT Letter units, or graphemes, have been reported in the lit- erature as a surprisingly effective substitute to the more tradi- tional phoneme units, at least in languages that enjoy a strong correspondence between pronunciation and orthography. For English however, where letter symbols have less acoustic consistency, previously reported results fell short of systems using highly-tuned pronunciation lexicons. Grapheme units simplify system design, but since graphemes map to a wider set of acoustic realizations than phonemes, we should ex- pect grapheme-based acoustic models to require more train- ing data to capture these variations. In this paper, we compare the rate of improvement of grapheme and phoneme systems trained with datasets rang- ing from 450 to 1200 hours of speech. We consider various grapheme unit configurations, including using letter-specific, onset, and coda units. We show that the grapheme systems improve faster and, depending on the lexicon, reach or sur- pass the phoneme baselines with the largest training set. Index TermsAcoustic modeling, graphemes, directory assistance, speech recognition. 1. INTRODUCTION Most large vocabulary speech recognition systems depend on three highly optimized models: a language model that esti- mates the probability of a sequence of words; a pronunciation model that describes how the words are divided into phoneme units; and an acoustic model that estimates the probability of observing a given acoustic feature vector in a given phonetic context. While the language and acoustic models are typically trained with statistical training algorithms, the pronunciation models tend to be more ad hoc. Most commercial systems rely on a combination of a hand-made lexicon for common words and a pronunciation generation engine for words not listed in the lexicon. Often these pronunciations are later re- fined algorithmically based on acoustic data (e.g. [1]), or re- vised manually for increased accuracy. While the language and acoustic models typically can grow and improve with more training data (e.g. more n- grams and longer spans for language models, more states and more Gaussians per state for acoustic models), the pronunci- ation models often don’t scale well with increasing amounts of data. This raises the question of whether it is desirable to keep a pronunciation model when large amounts of training data are available. In a sense, the lexicon provides a data-tying layer between the orthographic and acoustic representation of words, and as data increases, it is possible that this tying be- comes unecessary and may even become a bottleneck. One could easily build words out of letter-based units, or graphemes, instead of phoneme units, and transform the lex- icon generation problem into a purely acoustic training prob- lem. We may then expect common statistical approaches to lead to consistent improvements with increasing amounts of supervised and unsupervised data. The idea of considering alternatives to phoneme units is not new. More than 20 years ago, Cravero et al. [2] proposed a unit set optimized for consistency and cardinality. Ten years ago, several research groups investigated syllable units, which have the promise of an improved mapping between spelling and acoustics [3, 4, 5, 6]. More recently and perhaps due to a growing interest for recognizing multiple languages, researchers confronted with the bewildering task of maintaining not one but several lexi- cons asked the inevitable question “what if we just used let- ter units instead?” Kanthak et al. [7] and Killer et al. [8] observed experimentally that for some languages, grapheme systems performed roughly as well as phoneme systems, but that for others, such as English, there was a high error-rate cost to moving to graphemes. This was attributed by the au- thors to the poor spelling to pronunciation correspondance of the English language, which is another way of observing that, in English, letter units lack acoustic consistency, and that con- sistency matters, much like Cravero et al. had suggested. But the experiments reported in these papers relied on training sets of roughly tens of hours of speech. If consistency matters, then the amount of data should matter too. In this paper, we explore the scalability of grapheme sys- tems, i.e. how quickly their performance improves with data, compared to phoneme systems. We base our experiments on data from GOOG-411 [9], an automated system that uses speech recognition and web search to help people call busi- nesses. GOOG-411 is a good test bed for grapheme exper-