AN EFFICIENT FRAMEWORK FOR ROBUST MOBILE SPEECH RECOGNITION SERVICES R. C. Rose, I. Arizmendi, and S. Parthasarathy AT&T Labs – Research, Florham Park, NJ 07932 rose,iker,sps @research.att.com ABSTRACT A distributed framework for implementing automatic speech recognition (ASR) services on wireless mobile devices is presented. The framework is shown to scale easily to support a large number of mobile users connected over a wireless network and degrade gracefully under peak loads. The importance of using robust acoustic modeling techniques is demonstrated for situations when the use of specialized acoustic transducers on the mobile devices is not practical. It is shown that unsupervised acoustic nor- malization and adaptation techniques can reduce speech recognition word error rate (WER) by 30 percent. It is also shown that an unsupervised paradigm for updating and applying these robust modeling algorithms can be ef- ﬁciently implemented within the distributed framework. 1. INTRODUCTION This paper describes and evaluates a distributed ASR frame- work for mobile ASR services. The framework is eval- uated in terms of its ability to support a large number of simulated clients simultaneously using a limited set of ASR decoders. The framework currently supports di- rectory retrieval ASR applications for users of Compaq iPAQ mobile devices over an IEEE 802.11 wireless lo- cal area network [5]. An experimental study is presented demonstrating the effect of unsupervised speaker and en- vironment compensation algorithms in improving ASR performance when user utterances are spoken through the standard iPAQ device mounted microphone. There are a large number of applications for mobile devices that include automatic speech recognition (ASR) as a key component of the user interface. These include mutlimodal dialog applications [3], voice form ﬁlling ap- plications [5], and value added applications that provide short-cuts to user interface functions. Speech recogni- tion is generally just one part of a multi-modal dialog ar- chitecture for these mobile applications whose functional components can be distributed in different ways between computing resources residing in the network and on the mobile device. While there are a range of potential distributed ASR architectures that have been proposed for these applica- tions, one can make qualitative arguments for when ei- ther fully embedded ASR implementations or network based implementations are most appropriate. It is gen- erally thought that fully embedded implementations are most appropriate for value added applications like name dialing or digit dialing, largely because no network con- nectivity is necessary when ASR is implemented locally on the device [6]. Distributed or network based ASR im- plementations are considered appropriate for ASR based services that require access to large application speciﬁc databases where issues of database security and integrity make it impractical to distribute representations of the database to all devices [5]. Network based implemen- tations also facilitate porting the application to multiple languages and multiple applications without having to af- fect changes to the individual devices in the network. Acoustic variability in mobile domains is considered here to be a very important problem that distinguishes ASR in mobile domains from generic ASR domains. The main issue is that users of mobile devices will be using them in a wider variety of continuously varying acoustic environments making the expected conditions far differ- ent than one would expect in wire-line telephone or desk- top applications. However, the use of personalized de- vices and personalized services facilitates a new paradigm for implementing robust algorithms. Speaker, channel, and environment representations can be acquired through normal use of the device all of which can be applied to feature space and model space transformation in ASR. The feature domain speaker normalization/transformation algorithms described in Section 3 are applied and evalu- ated under this paradigm. The paper is composed of two major parts. The ﬁrst part, given in Section 2, will present a description of the framework along with simulations demonstrating the abil- ity of the framework to scale to a large number of clients. The second part, given in Section 3, discusses the im- plementation of speaker speciﬁc feature space normal- izations and transformations from user state information acquired and stored by the software framework in the net- work. The results of the simulations will be summarized in Section 4. 2. MOBILE ASR FRAMEWORK Modern multi-user applications are often challenged by the need to scale to a potentially large number of users while minimizing the degradation in service response even