SUBMOTIONS FOR HIDDEN MARKOV MODEL BASED DYNAMIC FACIAL ACTION RECOGNITION Dejan Arsi´ c * , Joachim Schenk * , Bj¨ orn Schuller, Frank Wallhoff and Gerhard Rigoll Technische Universit¨ at M ¨ unchen, Institute for Human Machine Communication Arcisstrasse 16, 80333 M ¨ unchen, Germany {arsic, schenk, schuller, wallhoff, rigoll}@tum.de ABSTRACT Video based analysis of a persons’ mood or behavior is in general performed by interpreting various features observed on the body. Fa- cial actions, such as speaking, yawning or laughing are considered as key features. Dynamic changes within the face can be modeled with the well known Hidden Markov Models (HMM). Unfortunately even within one class examples can show a high variance because of unknown start and end state or the length of a facial action. In this work we therefore perform a decomposition of those into so called submotions. These can be robustly recognized with HMMs, apply- ing selected points in the face and their geometrical distances. Ad- ditionally the ﬁrst and second derivation of the distances is included. A sequence of submotions is then interpreted with a dictionary and dynamic programming, as the order may be crucial. Analyzing the frequency of sequences shows the relevance of the submotions or- der. In an experimental section we show, that our novel submotion approach outperforms a standard HMM with the same set of features by nearly 30% absolute recognition rate. Index Terms— Dynamic face expression recognition, gabor jets, HMMs, submotions 1. INTRODUCTION On board security in aircraft cabins can be increased if we know how the passengers behave and what actions they perform during the ﬂight. We aim to implement an automated video surveillance system, which sets out alerts, if unruly behaviors are detected. In [1] we have presented the functionality of such a system, which decom- poses a complex behavior in several meaningful predeﬁned indica- tors (PDI). With the help of psychologists and criminal experts in the project SAFEE we decided which PDIs are of major importance on board an aircraft. Among global motion, hand movement, and the use of tools, facial actions seem to be most signiﬁcant. In contrast to other works we do not focus on face emotions as deﬁned by Ek- man [2], but on the activities laughing, speaking, yawning and other movements, for instance chewing or lip licking. Especially speaking and laughing may be detected better by audio, but in a noisy envi- ronment with a large number of people it is not possible to assign a sound to a single person. In order to recognize PDIs we propose working on image sequences in which the position of facial features such as mouth, eyes, eye- brows and nose are tracked with Gabor Jets [3]. In a following * Both authors contributed to this work equally This work has partially been funded by the European Union within the FP6 IST SAFEE Project. Special thanks to Thomas Mikschl for implement- ing the feature extraction. step geometrical relationships between these features are computed, which results in a mesh over the face. Additionally the ﬁrst and sec- ond derivation over time are determined. In real world scenarios facial actions do not start with a ﬁxed state and do not necessarily have constant transitions within the motion. The class of the facial activity may also change within as little as 3 frames. In order to cope with these constraints, we propose split- ting up a facial activity in several so called submotions, similar to phonemes in speech recognition [4]. These are able to describe tran- sitions between different facial states. For instance yawning may be described by opening the mouth, reaching a wide open mouth position, keeping it open and closing it slowly. Various different descriptions may be found for each activity. Therefore we suggest recognizing submotions with Hidden Markov Models, and a subse- quent classiﬁcation of the HMM output. We will present results of a frequency based approach and a distance measure, which considers the order of the appearance of submotions. Both approaches are able to compensate errors made during the submotion recognition tasks, and enhance the recognition results. In this work we will present classiﬁcation results with submotions on a behavior database, simulating events in an aircraft. Furthermore the recognition on complete actions without further decomposition has also been performed with HMMs, to evaluate the increase in performance after decomposition. The advantage of dynamic classi- ﬁcation is shown by comparison to static classiﬁcation with Support Vector Machines. 2. FACIAL FEATURES In order to achieve a high precision and keep computation times as low as possible we decided to work with a small set of meaningful feature points in the face. This way we spare out a large part of the face, and reduce the required amount of data. Based on the physiology of the face the MPEG-4 standard deﬁnes feature points, relevant for facial expression [5]. Out of the set of given points, 20, which may be detected automatically, have been chosen to describe faces. These are illustrated in ﬁgure 1 on the left side. Their coordinates are not generalizing faces in a person independent way. The size of eyes, mouth and eyebrows varies from person to person and the faces orientation is not considered. In a ﬁrst step, the size of the faces is normalized and they are rotated into upright position. Therefore the angle between the eyes is computed, as they are usually on the same height. Afterwards the images are scaled to the same size, by aligning the distance between the both eyes. Though the data is normalized there is still a large person depen- dency as well as variance between the different faces. Therefore we ICIP 2006, IEEE