Please cite this article in press as: V. Asadpour, et al., Audio–visual speaker identification using dynamic facial movements and utterance
phonetic content, Appl. Soft Comput. J. (2010), doi:10.1016/j.asoc.2010.07.007
ARTICLE IN PRESS
G Model
ASOC-930; No. of Pages 11
Applied Soft Computing xxx (2010) xxx–xxx
Contents lists available at ScienceDirect
Applied Soft Computing
journal homepage: www.elsevier.com/locate/asoc
Audio–visual speaker identification using dynamic facial movements and
utterance phonetic content
Vahid Asadpour
a,∗
, Mohammad Mehdi Homayounpour
b,1
, Farzad Towhidkhah
c,2
a
Azad University, Mashhad Branch, Biomedical Engineering Department, Ostad Yousefi, Mashhad, Iran
b
Amirkabir University of Technology, Computer Engineering Faculty, 424 Hafez, Tehran, Iran
c
Amirkabir University of Technology, Biomedical Engineering Faculty, 424 Hafez, Tehran, Iran
article info
Article history:
Received 5 September 2007
Received in revised form 8 May 2010
Accepted 13 July 2010
Available online xxx
Keywords:
Biometry
Dynamic face movement
Hill type muscle model
Adaptive network fuzzy inference systems
Kalman filtering
Auto recursive moving average
abstract
Robust multimodal identification systems based on audio–visual information has not been thoroughly
investigated yet. The aim of this work is to propose a model-based feature extraction method which
employs physiological characteristics of facial muscles producing lip movements. This approach adopts
the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the
dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of
speaker; consequently imitation of valid speakers could be reduced to a large extent. These parameters
are applied to a Hidden Markov Model (HMM) audio–visual identification system. In this work a combi-
nation of audio and video features has been employed by adopting a multistream pseudo-synchronized
HMM training method. The proposed model is compared to other feature extraction methods including
Kalman filtering, neural networks, adaptive network fuzzy inference system (ANFIS) and auto recursive
moving average. The superior performance of the proposed system is demonstrated on a large mul-
tispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. The
combined Kalman filtering and proposed model led to the best performance. The phonetic content of
pronounced sentences is also evaluated to achieve the optimized phonetic combinations which lead to
the best identification rate.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
Speaker recognition can be classified into speaker identification
and speaker verification. Speaker identification may be “open-set”
or “closed-set”. In closed-set speaker identification, test utterance is
compared to reference models of only those speakers whose mod-
els have been trained during training phase. Therefore, in closed-set
identification, only miss-classification error can occur. While in
open-set speaker identification, test utterance may belong to a
speaker not included in training speaker set. So, identification sys-
tem, should first decide whether the test speaker is a speaker among
training speaker set or not (acceptance/rejection) and if the speaker
belongs to this set, he/she should be identified.
Speaker recognition systems are usually based on one of the
text dependent, text-prompted and text-independent methods.
Text dependent systems are the most widely-deployed variant
∗
Corresponding author. Tel.: +98 915 101 8024/511 6029100.
E-mail addresses: asadpour@aut.ac.ir (V. Asadpour), homayoun@aut.ac.ir (M.M.
Homayounpour), farzadt@aut.ac.ir (F. Towhidkhah).
1
Tel.: +98 2164542722.
2
Tel.: +98 2164542363.
which require input of one or more passwords. The combi-
nation of biometric security system with a password provides
strong authentication. Text-prompted systems enroll a number of
words or numeric sequences. During recognition they randomly
select among those words and instruct the claimant to repeat
that item. The randomness of text-prompting further increases
a system’s resistance to invaders. The most common methods
for text-dependent and text-prompting speaker verification are
dynamic time warping, HMM models, and neural networks [1].
Text-independent technology operates on free-flowing speech and
the client is not requested to say the same sentence during each
access. Thus, the only information used by the system is the
acoustic characteristics of the client. The most common methods
for text-independent speaker verification are vector quantization,
Euclidean or Mahalanobis distance, and Gaussian mixture models
[1]. This paper concerns closed-set text-dependent speaker identi-
fication.
Intellectual speaker recognition by human includes the pro-
cess of uttered audio and visual information in hierarchical levels.
Face perception is mediated by a distributed neural system in
humans. This system consists of multiple functional organization
that embodies a distinction between the representation of invari-
ant aspects of faces, which is a basis for recognizing individuals,
1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.asoc.2010.07.007