SPATIO-TEMPORAL MID-LEVEL FEATURE BANKFOR ACTION RECOGNITION IN LOW QUALITY VIDEO Saimunur Rahman, John See Centre of Visual Computing, Faculty of Computing and Informatics Multimedia University Cyberjaya 63100, Malaysia ABSTRACT It is a great challenge to perform high level recognition tasks on videos that are poor in quality. In this paper, we propose a new spatio-temporal mid-level (STEM) feature bank for rec- ognizing human actions in low quality videos. The feature bank comprises of a trio of local spatio-temporal features, i.e. shape, motion and textures, which respectively encode struc- tural, dynamic and statistical information in video. These features are encoded into mid-level representations and ag- gregated to construct STEM. Based on the recent binarized statistical image feature (BSIF), we also design a new spatio- temporal textural feature that extracts discriminately from 3D salient patches. Extensive experiments on the poor quality versions/subsets of the KTH and HMDB51 datasets demon- strate the effectiveness of the proposed approach. Index Terms— Action recognition, Low quality video, Mid-level representation, Texture features, BSIF 1. INTRODUCTION Action recognition [1, 2, 3, 4, 5] is becoming increasingly im- portant today due to its various application domains such as video surveillance, video indexing and searching, and human- computer interaction. However, action recognition in real world scenarios still remains a challenging issue especially concerning video quality [6, 7, 8]; typical problems include low resolution and frame rates, compression artifacts, back- ground clutter, camera ego-motion and jitter. Despite ad- vances in video technology, there is still an undeniable need for efﬁcient processing, storage and transmission. As such, it is crucial to deal with the problem of low video quality by designing more robust approaches to action recognition. In recent years, various methods have been developed to recognize human actions from video. Currently, among handcrafted methods, shape and motion are the most widely used by the action recognition community. The extraction of these features consists of two essential steps – a detection step where important points or salient regions are extracted from the video, and description step which then describes Fig. 1. Low quality videos generally affected by poor reso- lution, sampling rate, motion blur and compression artifacts. Here are sample video frames from (top row) KTH (down- sampled version) and (bottom row) HMDB51 (with ’bad’ quality label) datasets. the patterns from the extracted region. Among typical de- tectors include space-time interest points [9], cuboids [10], dense sampling [1] and dense trajectories [11]. The HOG and HOF features [1, 9] appeared most prominently in recent state-of-the-art approaches owing to their effectiveness in characterizing dynamic and structural properties of actions. However, their reliance on localized feature regions may ren- der them ineffective when discriminating between actions in low video quality [7]. The use of textural features are less common in action recognition; among proposed representations include LBP- TOP [12] and Extended LBP-TOP [13]. These methods use the notion of three orthogonal planes (TOP) to extend static image-based textures to spatio-temporal dynamic textures. More recently, Kannala & Rahtu proposed binarized statisti- cal image features (BSIF) [14] which have showed tremen- dous potential compared to its predecessors. While these methods were able to show effective result across different action datasets, our recent work [7] has shown that local STIP features [9] become increasingly ineffective when video qual- ity deteriorate spatially and temporally. It was shown that this problem can be alleviated by introducing complementary ro- bust global textural features. Moreover, the holistic nature of