Exploiting Context-Dependency and Acoustic Resolution of Universal Speech Attribute Models in Spoken Language Recognition Sabato Marco Siniscalchi 1 , Jeremy Reed 2 , Torbjørn Svendsen 3 , and Chin-Hui Lee 2 1 Department of Telematics, University of Enna “Kore”, Enna, Italy 2 School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA. USA 3 Department of Electronics and Telecommunications, NTNU, Trondheim, Norway marco.siniscalchi@unikore.it, jeremy.reed@gatech.edu, torbjorn@iet.ntnu.no, chl@ece.gatech.edu Abstract This paper expands a previously proposed universal acous- tic characterization approach to spoken language identification (LID) by studying different ways of modeling attributes to improve language recognition. The motivation is to describe any spoken language with a common set of fundamental units. Thus, a spoken utterance is first tokenized into a sequence of universal attributes. Then a vector space modeling approach de- livers the final LID decision. Context-dependent attribute mod- els are now used to better capture spectral and temporal charac- teristics. Also, an approach to expand the set of attributes to in- crease the acoustic resolution is studied. Our experiments show that the tokenization accuracy positively affects LID results by producing a 2.8% absolute improvement over our previous 30- second NIST 2003 performance. This result also compares fa- vorably with the best results on the same task known by the authors when the tokenizers are trained on language-dependent OGI-TS data. Index Terms: language identification, latent semantic analysis. 1. Introduction LID is the process of identifying the language spoken in a sam- ple of speech by an unknown speaker. Each language has its own unique set of characteristics, referred to as the acoustic sig- nature of the language, which makes it different from any an- other language. This acoustic signature can be discovered using information from multiple sources, such as prosody, phonotac- tic structure, lexical knowledge, acoustic features, vocabulary, and articulatory features. Spectral- and token-based approaches are statistical techniques usually adopted to decode the acoustic signature of a language. Spectral-based approaches try to deter- mine the language of a spoken query exploiting only acoustic cues (e.g., [1]). Token-based approaches exploit linguistic prop- erties in addition to acoustical information. For example, the phone recognition followed by language modeling (PLRM) [2] approach uses a set of phone models (tokenizer) to convert each speech utterance into a language-dependent string of units (to- kens). Multiple interpolated n-gram language-dependent mod- els allow higher-order statistics to guide the decision. The paral- lel phone recognition followed by language modeling (PPRLM) [2] is a successful extension of PLRM. The token-based paradigm has constantly reported superior results over the spectral-based methods on the NIST Language Recognition Evaluation (LRE) tasks [1]. Unfortunately, the token-based paradigm also suffers two main drawbacks. First, training a tokenizer requires labeled data, which is difficult for rarely observed languages or languages without orthography and a well-documented phonetic dictionary. Second, the decod- ing phase is usually computationally intensive. To overcome these issues, several authors have proposed LID systems based on language-independent (or universal) acoustic phone models, e.g., [3, 4, 5]. However, the combined phone list generated from the limited set of initial languages usually does not cover new and rarely seen languages. In [6], the authors address this issue by defining a set of universal acoustic segment models (ASMs) that characterize all spoken languages; however, a good charac- terization requires hundreds of ASMs. Recently, we have pre- sented a novel vector space modeling (VSM) approach to LID [7] that characterizes languages using tokens based on articula- tory features. Within this framework, a complete characteriza- tion of spoken documents can be obtained using only 15 tokens (5 manner of articulation tokens, 9 place of articulation tokens, and the silence token) compared to the hundreds of acoustic segments used in the ASM-based approach. In this paper, we extend our VSM-based LID approach by designing better attribute recognizers (tokenizers). Other au- thors have already shown that the tokenization accuracy directly affects LID results, but this accuracy was improved by sim- ply increasing the amount of labeled training data. Conversely, the key ideas of the proposed approach are to use context- dependent (CD) attribute models, namely right-context (RC) dependent models, and increase the acoustic resolution of the tokenizers. RC subwords models model spectral and temporal information in diverse context better, which produces a more accurate tokenization. Experimental results show that the at- tribute tokenization error rate decreased from 27.99% to 24.5% and from 57.07% to 36.45% for manner and place, respectively. The set of manner attributes is expanded from five to nine as a first attempt in increasing the acoustic resolution of the man- ner tokenizer. A final equal error rate (EER) of 8.5% is at- tained on the 30-second NIST 2003 task when the universal tokenizer is trained using the OGI Multi-language Telephone Speech (OGI-TS) 1 database. This result represents an abso- lute error reduction of 2.8% with respect to our previously re- ported performance [7]. To the best of the authors’ knowledge, the reported performance also outperforms the best language- dependent PRLM systems trained on language-specific OGI-TS database and tested on the same task [8]. 2. Universal Speech Attribute LID System In [7], we have demonstrated that the LID task can be effec- tively addressed with the system shown in Figure 1. This LID system consists of two main blocks: a front-end, shown in the left-side of Figure 1, and a back-end, shown in the right-side of Figure 1. The front-end implements a universal attribute recog- nizer (UAR) that decodes a spoken utterance into two sequences 1 http://cslu.cse.ogi.edu/corpora/corpCurrent.html/ Copyright 2010 ISCA 26 - 30 September 2010, Makuhari, Chiba, Japan INTERSPEECH 2010 2718 10.21437/Interspeech.2010-720