Available online at www.sciencedirect.com
Computer Speech and Language 27 (2013) 209–227
Universal attribute characterization of spoken languages for
automatic spoken language recognition
Sabato Marco Siniscalchi
a,∗
, Jeremy Reed
b
, Torbjørn Svendsen
c
, Chin-Hui Lee
d
a
Faculty of Engineering and Architecture, Kore University of Enna, Cittadella Universitaria, Enna, Sicily, Italy
b
Georgia Tech Research Institute, Georgia Institute of Technology, Atlanta, GA 30332, USA
c
Department of Electronics and Telecommunications, Norwegian University of Science and Technology, 7491 Trondheim, Norway
d
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
Received 17 November 2011; received in revised form 31 March 2012; accepted 7 May 2012
Available online 23 May 2012
Abstract
We propose a novel universal acoustic characterization approach to spoken language recognition (LRE). The key idea is to describe
any spoken language with a common set of fundamental units that can be defined “universally” across all spoken languages. In
this study, speech attributes, such as manner and place of articulation, are chosen to form this unit inventory and used to build a
set of language-universal attribute models with data-driven modeling techniques. The vector space modeling approach to LRE is
adopted, where a spoken utterance is first decoded into a sequence of attributes independently of its language. Then, a feature vector
is generated by using co-occurrence statistics of manner or place units, and the final LRE decision is implemented with a vector
space language classifier. Several architectural configurations will be studied, and it will be shown that best performance is attained
using a maximal figure-of-merit language classifier. Experimental evidence not only demonstrates the feasibility of the proposed
techniques, but it also shows that the proposed technique attains comparable performance to standard approaches on the LRE tasks
investigated in this work when the same experimental conditions are adopted.
© 2012 Elsevier Ltd. All rights reserved.
Keywords: Spoken language recognition; Vector space model; Latentsemantic analysis; Artificial neural network; Support vectormachine; Phonetic
features
1. Introduction
The process of detecting the presence of a given spoken language in a segment of speech by an unknown speaker
is commonly referred to as language recognition (LRE). Spoken languages are univocally characterized by their own
set of characteristics, referred to as the acoustic signature of the language, which makes them differ from one another.
This acoustic signature can be discovered using information from multiple sources, such as prosody (Adami and
Hermansky, 2003; Adda-Decker et al., 2003), phonotactic structure (Hazen, 1993; Zissman, 1996), lexical knowledge
(Matrouf et al., 1998), acoustic features (Sugiyama, 1991), and articulatory features (Kirchhoff et al., 2002). In order
to accomplish language recognition automatically, LRE is usually formulated as a pattern recognition problem and
This paper has been recommended for acceptance by ‘Bill Byrne’.
∗
Corresponding author.
E-mail address: marco.siniscalchi@unikore.it (S.M. Siniscalchi).
0885-2308/$ – see front matter © 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.csl.2012.05.001