ONLINE HUMAN ACTION LOCALISATION BASED ON APPEARANCE AND MOTION CUES Saha S., Cuzzolin F., - Oxford Brookes University Sapienza M., - University of Oxford {suman.saha-2014, fabio.cuzzolin}@brookes.ac.uk, michael.sapienza@eng.ox.ac.uk Abstract We investigate the problem of online action lo- calisation in videos. Our model uses appearance and motion cues to generate region proposals from streaming video frames. Recently, deep feature rep- resentation outperforms the handcrafted features in object classiﬁcation. Driven by this progress, we model our system using deep CNN features. We proposed an online incremental learning framework which initially learns from a burst of streaming video frames and iteratively updates the learner by solving a set of linear SVMs (1-vs-rest) using a batch stochastic gradient descent (SGD) algorithm with hard example mining. State-of-the-art • Recently deep learning technique [1] outper- forms the hand-crafted feature representation approaches in action classiﬁcation [3] and de- tection [2]. • However, a robust online action detection system is yet to be addressed by the vision- community! • Therefore, it is worthwhile to investigate the state-of-the-art deep learning approach for on- line action detection. Our approach Region Proposals Generator References [1] A., Krizhevsky, S., Ilya, G., E., Hinton, ImageNet Classiﬁcation with Deep Convolutional Neural Networks, in NIPS, Year 2012. [2] G., Gkioxari, J., Malik, Finding Action Tubes, in CVPR, Year 2015. [3] K., Simonyan, A., Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, in CoRR, Year 2014. [4] P., Felzenszwalb, R., Girshick, D., McAllester, D., Ramanan, Object detec- tion with discriminatively trained part based models, PAMI, Year 2010. [5] R., Girshick, J., Donahue, T., Darrel, J., Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in CVPR, Year 2014. [6] T., Brox, A., Bruhn, N., Papenberg, j., Weickert, High accuracy optical ﬂow estimation based on a theory for warping, in Proc. ECCV, Year 2004. [7] X., Chenliang, X., Caiming, J., Corso, Streaming Hierarchical Video Segmentation, in ECCV, Year 2012. Actionness ranking Video056 Video167 The above Figure shows a strong correlation between the motion saliency measures and the IoU overlap which is necessary to ensure good performance at test time. Notice in rows 2 and 4, our estimated actionness regions (depicted in blue bounding boxes) highly overlap with the ground truth annotations (shown in green bounding boxes). Videos contain actions such as "try enter room unsuccessfully" (video 56, 167), "leave baggage unattended" (video 62) and "put take obj into from box/ desk" (video 167) exhibit relatively higher movements, and thus, have larger mean IoU scores between 0.44 to 0.53. Whereas, actions like "typing on keyboard", "telephone conversation" and "discussion" (video 11) involve less movements and have lower IoU scores between 0.3 to 0.41. Localising multiple actions Localising action with low movements Video062 Localising multiple actions Video009 Video011 Video167 Online learner • Given CNN feature vectors x i ∈ R n for region proposals R i ,a set of linear SVMs (1-vs-rest) is used to assign class labels y i to R i . • In a classical SVM setting the following objective function is min- imised: o D ( ˆ w)= 1 2 ‖ ˆ w‖ 2 + C D  i=1 max(0, 1 − y i ˆ w T ˆ x i ) (1) where: dataset D = {(x 1 ,y 1 ), ..., (x D ,y D )}; vector of param- eters ˆ w =(w, b); regularisation parameter C ; total number of examples D ; y ∈ {−1, +1}; ˆ x =(x, 1) augmented to include a bias-multiplier. • In our case, data is streamed in time, therefore, we use a batch variant of SGD which iteratively updates ˆ w by taking a step in the negative direction of the gradient w.r.t. a randomised exam- ple set E t ⊆D. • Given inputs ˆ w t , H t and E t , the following is a single step to- wards minimum of the objective function in Eq. 1: ˆ w t+1 := ˆ w t − α t ( ˆ w t + C  ( ˆ x i ,y i )∈H t h( ˆ w t , ˆ x i ,y i )) (2) where α t : the learning step size at time t; H: cache of hard exam- ples [4], H := H∪ sample(E t , batch-size). Results and Conclusion 0 2 4 6 8 10 0.3 0.35 0.4 0.45 0.5 0.55 Action Class IoU Score 0 20 40 60 80 100 120 0 100 200 300 400 500 600 700 Video Clip Number of Region Proposals per Video SGBH Region Proposals Motion Saliency Region Proposals Experimental Setup: challenging LIRIS HARL human activity dataset with 10 complex human action classes; a desktop with Intel Core i5-3570 CPU @ 3.40GHz x 4 and 32 GB RAM. Figure (a) shows actions (1 to 7) which involve relatively higher movements have larger IoU scores and actions (0,8 and 9) which have relatively low movements have smaller IoU scores; thus, a strong correlation between the notion actionness (or motion saliency measure) and IoU overlap can be observed. Figure (b) shows a dramatic reduction in the image search space with 45 times less region proposals after pruning. The IoU score is averaged over all the class-speci c training videos. Figure (b) (a) Table 1 Table 2 Notice in Table 1, actions 3, 5 and 6 appear in both top-10 and lowest-10 IoU score lists. The reason for this discrepancy between IoU scores of same action classes is the SGBH region proposals. SGBH hierarchically groups regions based on only a single region merging criterion, i.e., the measure of similarity between two regions in a video is the Chi squared distance between the colour histograms of those regions in Lab colour space. Due to a single merging criterion, the space-time region proposals do not give consistent overlaps with GT annotations, also, region proposals drift over time. Thus, the resultant SGBH bounding boxes yield low IoU scores degrading the overall performance of the detection system. Table 2 shows the LIRIS HARL 10 action classes and their label ids. (*) IoU scores are averaged over each video clip. Conclusion and future work: • Our motion saliency algorithm is robust to detect multiple actions simultaneously. • To improve the IoU scores, we will combine our motion saliency method with segmen- tation algorithms which consider a range of diversiﬁed region merging criteria such as "Selective Search". • To make our motion saliency algorithm ro- bust against multiple actions, we will use techniques like "non-maximum suppression" [5]. • We plan to integrate motion and appearance features by incorporating a separate CNN to encode action dynamics from multiple con- secutive video frames. • During test time we link multiple compound hypotheses (region proposals) over individ- ual action tubes as per their class speciﬁc SVM-scores. Method 1. Extract space-time region proposals R i from a burst of video frames F j using SGBH (streaming graph based hierarchical ) video segmentation algorithm [7]. 2. Prune region proposals (cf. Step-1) using motion saliency scores Sm obtained from dense optical ﬂow ﬁelds computed over F j [6]. 3. Rank the motion salient region proposals (cf. Step-2) using the intersection-over-union (IoU) scores with respect to the ground truth annotation. 4. Obtain image patch descriptor for each ranked region proposals using a pre-trained Convolutional Neural Network (CNN) . 5. Train an online incremental learning algorithm with the CNN features (cf. Step-4) for action classiﬁcation and detection. Motion saliency score S m = ∑ i∈R f m (i) ∑ j ∈I f m (j ) ; f m is the normalised optical ﬂow magnitude. IoU score a o = area(B p ∩B gt ) area(B p ∪B gt ) ; B p proposed bounding box and B gt ground truth annotation. (c) (a) (b) (d) Figure (a): region proposals; Figure (b): dense optical ow elds; Figure (c): region proposals pruned using optical ow; Figure (d): region proposals nally selected by our proposed ranking algorithm.