Computer Speech and Language 27 (2013) 135–150 Speaker state recognition using an HMM-based feature extraction method R. Gajˇ sek , F. Miheliˇ c, S. Dobriˇ sek University of Ljubljana, Faculty of Electrical Engineering, Trˇ zaˇ ska Cesta 25, 1000 Ljubljana, Slovenia Received 1 May 2011; received in revised form 29 December 2011; accepted 18 January 2012 Available online 2 February 2012 Abstract In this article we present an efficient approach to modeling the acoustic features for the tasks of recognizing various paralinguistic phenomena. Instead of the standard scheme of adapting the Universal Background Model (UBM), represented by the Gaussian Mixture Model (GMM), normally used to model the frame-level acoustic features, we propose to represent the UBM by building a monophone-based Hidden Markov Model (HMM). We present two approaches: transforming the monophone-based segmented HMM–UBM to a GMM–UBM and proceeding with the standard adaptation scheme, or to perform the adaptation directly on the HMM–UBM. Both approaches give superior results than the standard adaptation scheme (GMM–UBM) in both the emotion recognition task and the alcohol detection task. Furthermore, with the proposed method we were able to achieve better results than the current state-of-the-art systems in both tasks. © 2012 Elsevier Ltd. All rights reserved. Keywords: Emotion recognition; Intoxication recognition; Hidden Markov Models; Universal Background Model; Model adaptation 1. Introduction Augmenting a human–computer interaction (HCI) system with various paralinguistic recognition capabilities has recently gained a lot of attention from the speech processing community. As stated by Cowie et al. (2001), the speech communication between humans can be split into two channels, one transmitting the explicit information, and the other transmitting the implicit information. The explicit information channel represents “what” is being said and has been studied for years with the development of speech recognition systems. The implicit channel represents “how” it is being said and consists of different phenomena such as emotions, gender, age, stress, identity, etc. While some (identity, gender) have been studied in the past, others such as emotions or stress, have been somewhat neglected. But in order to insure that the communication with the artificial systems is perceived by humans as natural, the implicit channel of communication needs to be incorporated as well. Furthermore, the system’s dialog manager could benefit greatly from the added information about the user’s state such as age, gender, emotional state, etc., when estimating the type of reaction to the user’s command. In certain circumstances, even other knowledge about the speaker state P R E P R I N T For personal use only