Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2011, Article ID 294010, 14 pages
doi:10.1155/2011/294010
Research Article
On the Soft Fusion of Probability Mass Functions for
Multimodal Speech Processing
D. Kumar, P. Vimal, and Rajesh M. Hegde
Department of Electrical Engineering, Indian Institute of Technology, Kanpur 208016, India
Correspondence should be addressed to Rajesh M. Hegde, rhegde@iitk.ac.in
Received 25 July 2010; Revised 8 February 2011; Accepted 2 March 2011
Academic Editor: Jar Ferr Yang
Copyright © 2011 D. Kumar et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multimodal speech processing has been a subject of investigation to increase robustness of unimodal speech processing systems.
Hard fusion of acoustic and visual speech is generally used for improving the accuracy of such systems. In this paper, we discuss the
significance of two soft belief functions developed for multimodal speech processing. These soft belief functions are formulated on
the basis of a confusion matrix of probability mass functions obtained jointly from both acoustic and visual speech features. The
first soft belief function (BHT-SB) is formulated for binary hypothesis testing like problems in speech processing. This approach
is extended to multiple hypothesis testing (MHT) like problems to formulate the second belief function (MHT-SB). The two
soft belief functions, namely, BHT-SB and MHT-SB are applied to the speaker diarization and audio-visual speech recognition
tasks, respectively. Experiments on speaker diarization are conducted on meeting speech data collected in a lab environment and
also on the AMI meeting database. Audiovisual speech recognition experiments are conducted on the GRID audiovisual corpus.
Experimental results are obtained for both multimodal speech processing tasks using the BHT-SB and the MHT-SB functions. The
results indicate reasonable improvements when compared to unimodal (acoustic speech or visual speech alone) speech processing.
1. Introduction
Multi-modal speech content is primarily composed of acous-
tic and visual speech [1]. Classifying and clustering multi-
modal speech data generally requires extraction and com-
bination of information from these two modalities [2].
The streams constituting multi-modal speech content are
naturally different in terms of scale, dynamics, and temporal
patterns. These differences make combining the informa-
tion sources using classic combination techniques difficult.
Information fusion [3] can be broadly classified as sensor
level fusion, feature level fusion, score-level fusion, rank-
level fusion, and decision-level fusion. A hierarchical block
diagram indicating the same is illustrated in Figure 1.
Number of techniques are available for audio-visual infor-
mation fusion, which can be broadly grouped into feature
fusion and decision fusion. The former class of methods
are the simplest, as they are based on training a traditional
HMM classifier on the concatenated vector of the acoustic
and visual speech features, or an appropriate transformation
on it. Decision fusion methods combine the single-modality
(audio-only and visual-only) HMM classifier outputs to
recognize audio-visual speech [4, 5]. Specifically, class
conditional log-likelihoods from the two classifiers are
linearly combined using appropriate weights that capture the
reliability of each classifier, or feature stream. This likelihood
recombination can occur at various levels of integration,
such as the state, phone, syllable, word, or utterance level.
However, two of the most widely applied fusion schemes
in multi-modal speech processing are concatenative feature
fusion (early fusion) and coupled hidden Markov models
(late fusion).
1.1. Feature Level Fusion. In the concatenative feature fusion
scheme [6], feature vectors obtained from audio and
video modalities are concatenated and the concatenated
vector is used as a single feature vector. Let the time
synchronous acoustic and visual speech features at instant
t, be denoted by O
(t)
S
∈ R
Ds
, where D
s
is the dimen-
sionality of the feature vector, and s = A, V , for audio
and video modalities, respectively. The joint audio-visual