Sign Language Recognition using Sequential Pattern Trees Eng-Jon Ong Helen Cooper Nicolas Pugeault Richard Bowden The Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, Surrey, UK e.ong,h.cooper,n.pugeault,r.bowden@surrey.ac.uk Abstract This paper presents a novel, discriminative, multi-class classifier based on Sequential Pattern Trees. It is efficient to learn, compared to other Sequential Pattern methods, and scalable for use with large classifier banks. For these rea- sons it is well suited to Sign Language Recognition. Us- ing deterministic robust features based on hand trajectories, sign level classifiers are built from sub-units. Results are presented both on a large lexicon single signer data set and a multi-signer Kinect TM data set. In both cases it is shown to out perform the non-discriminative Markov model ap- proach and be equivalent to previous, more costly, Sequen- tial Pattern (SP) techniques. 1. Introduction This paper attempts to tackle the problem of indepen- dent sign-language recognition. Sign Language, being as complex as any spoken language, has many thousands of signs each differing from the next by minor changes in hand motion, shape or position. Its grammar includes the mod- ification of signs to indicate an adverb modifying a verb and the concept of placement where objects or people are given a spatial position and then referred to later. This, coupled with the intra-signer differences make true Sign Language Recognition (SLR) an intricate challenge. Previ- ous SLR work has shown the advantage of using tracking- based, sub-unit classifiers [6]. While others have shown re- sults on larger datasets using data driven approaches. Wang et al., created an American Sign Language (ASL) dictio- nary based on similarity between signs using a Dynamic Space-Time Warping (DSTW) approach. They used an ex- emplar, sign level approach and did not use Hidden Markov Models (HMMs) due to the high quantities of training data required. They present results for a dictionary containing 1113 signs [12]. More recently, Pitsikalis et al. [9] pro- posed a method which uses linguistic labelling to split signs into sub-units. From this they learn signer specific mod- els, which are then combined via HMMs to create a classi- fier over 961 signs. The common requirement of tracking means that the Kinect TM offers the sign recognition com- munity a short-cut to real-time performance. Zafrulla et al. used this to extend their previous CopyCat game for deaf children [13]. By using a cross validated system they trained HMMs to recognise signs. One disadvantage of HMMs is that they are learnt in a non-discriminatory fash- ion. As a result, during the learning process, data from al- ternate classes are ignored. This can result in sub-optimal classifiers, particularly when there are large ambiguities be- tween different classes. Additionally, HMMs do not per- form explicit feature selection. As a result, features that do not contribute or are detrimental to the recognition process are always included. To address the above issues, we consider discriminative spatio-temporal patterns for classification called sequential patterns (SPs). SPs are sequences of feature subsets. Us- ing SPs provides the advantage of explicit spatio-temporal feature selection. Additionally, SPs do not require dynamic time warping for temporal alignment. A set of SPs can also be stored and later used efficiently within a tree structure. Research in SPs mainly lie in the area of data mining, where the objective is finding frequent SPs. Related to our work is Ayres et al. [1], where SPs are organised into a prefix tree, which is then used to mine frequent patterns; the tree itself is discarded upon completion of the mining process. Later, Hong et al. [5] proposed a method for mining fre- quent patterns by means of incrementally updating to a tree structure based on shared subsequences between frequent SPs. It should be noted that [1, 5] do not consider the prob- lem of classification; simply mining of SPs that frequently occur over a set of unlabelled examples. The use of SPs for classification of signs was recently proposed by Elliott et al[4], where SPs were learnt in a discriminatory fashion be- fore being combined into strong classifiers for recognising signs. One major drawback of using SPs for classification is that they only permit binary classifiers to be built. For problems with more than 2 classes, it is necessary to em- ploy 1vs1 classifiers within a voting framework; a method that does not scale well with the number of classes. 1