Please cite this article in press as: V. Asadpour, et al., Audio–visual speaker identification using dynamic facial movements and utterance phonetic content, Appl. Soft Comput. J. (2010), doi:10.1016/j.asoc.2010.07.007 ARTICLE IN PRESS G Model ASOC-930; No. of Pages 11 Applied Soft Computing xxx (2010) xxx–xxx Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc Audio–visual speaker identification using dynamic facial movements and utterance phonetic content Vahid Asadpour a, , Mohammad Mehdi Homayounpour b,1 , Farzad Towhidkhah c,2 a Azad University, Mashhad Branch, Biomedical Engineering Department, Ostad Yousefi, Mashhad, Iran b Amirkabir University of Technology, Computer Engineering Faculty, 424 Hafez, Tehran, Iran c Amirkabir University of Technology, Biomedical Engineering Faculty, 424 Hafez, Tehran, Iran article info Article history: Received 5 September 2007 Received in revised form 8 May 2010 Accepted 13 July 2010 Available online xxx Keywords: Biometry Dynamic face movement Hill type muscle model Adaptive network fuzzy inference systems Kalman filtering Auto recursive moving average abstract Robust multimodal identification systems based on audio–visual information has not been thoroughly investigated yet. The aim of this work is to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently imitation of valid speakers could be reduced to a large extent. These parameters are applied to a Hidden Markov Model (HMM) audio–visual identification system. In this work a combi- nation of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. The proposed model is compared to other feature extraction methods including Kalman filtering, neural networks, adaptive network fuzzy inference system (ANFIS) and auto recursive moving average. The superior performance of the proposed system is demonstrated on a large mul- tispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. The combined Kalman filtering and proposed model led to the best performance. The phonetic content of pronounced sentences is also evaluated to achieve the optimized phonetic combinations which lead to the best identification rate. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Speaker recognition can be classified into speaker identification and speaker verification. Speaker identification may be “open-set” or “closed-set”. In closed-set speaker identification, test utterance is compared to reference models of only those speakers whose mod- els have been trained during training phase. Therefore, in closed-set identification, only miss-classification error can occur. While in open-set speaker identification, test utterance may belong to a speaker not included in training speaker set. So, identification sys- tem, should first decide whether the test speaker is a speaker among training speaker set or not (acceptance/rejection) and if the speaker belongs to this set, he/she should be identified. Speaker recognition systems are usually based on one of the text dependent, text-prompted and text-independent methods. Text dependent systems are the most widely-deployed variant Corresponding author. Tel.: +98 915 101 8024/511 6029100. E-mail addresses: asadpour@aut.ac.ir (V. Asadpour), homayoun@aut.ac.ir (M.M. Homayounpour), farzadt@aut.ac.ir (F. Towhidkhah). 1 Tel.: +98 2164542722. 2 Tel.: +98 2164542363. which require input of one or more passwords. The combi- nation of biometric security system with a password provides strong authentication. Text-prompted systems enroll a number of words or numeric sequences. During recognition they randomly select among those words and instruct the claimant to repeat that item. The randomness of text-prompting further increases a system’s resistance to invaders. The most common methods for text-dependent and text-prompting speaker verification are dynamic time warping, HMM models, and neural networks [1]. Text-independent technology operates on free-flowing speech and the client is not requested to say the same sentence during each access. Thus, the only information used by the system is the acoustic characteristics of the client. The most common methods for text-independent speaker verification are vector quantization, Euclidean or Mahalanobis distance, and Gaussian mixture models [1]. This paper concerns closed-set text-dependent speaker identi- fication. Intellectual speaker recognition by human includes the pro- cess of uttered audio and visual information in hierarchical levels. Face perception is mediated by a distributed neural system in humans. This system consists of multiple functional organization that embodies a distinction between the representation of invari- ant aspects of faces, which is a basis for recognizing individuals, 1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2010.07.007