Journal of Signal Processing Systems https://doi.org/10.1007/s11265-018-1363-x Deep Packet Flow: Action Recognition via Multiresolution Deep Wavelet Packet of Local Dense Optical Flows Novanto Yudistira 1 · Takio Kurita 1 Received: 13 June 2016 / Revised: 21 January 2018 / Accepted: 28 March 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract Action recognition with dynamic actor and scene has been a tremendous research topic. Recently, spatio temporal features such as optical flows have been utilized to define motion representation over sequence of time. However, to increase accuracy, deep decomposition is necessary either to enrich information under location or time-varying actions due to spatio temporal dynamics. To this end, we propose algorithm consists of vectors obtained by applying multi-resolution analysis of motion using Haar Wavelet Packet (HWP) over time. Its computation efficiency and robustness have led HWP to gain popularity in texture analysis but their applicability in motion analysis is yet to be explored. To extract representation, a sequence of bin of Histogram of Flow (HOF) is treated as signal channel. Deep decomposition is then applied by utilizing Wavelet Packet decomposition called Packet Flow to many levels. It allows us to represent action’s motions with various speeds and ranges which focuses not only on HOF within one frame or one cuboid but also on the temporal sequence. HWP, however, has translation covariant property that is not efficient in performance because actions occur in arbitrary time and various sampling location. To gain translation invariant capability, we pool each respective coefficient of decomposition for each level. It is found that with proper packet selection, it gives comparable results on the KTH action and Hollywood dataset with train-test division without localization. Even if spatiotemporal cuboid sampling is not densely sampled like of baseline method, we achieve lower complexity and comparable performance on camera motion burdened dataset like UCF Sports that motion features such as HOF do not perform well. Keywords Action recognition · Wavelet packet analysis · Temporal dynamics · Dense optical flows 1 Introduction Intelligent vision system [19] especially action recognition is growing topics in computer vision and pattern recogni- tion. It is gaining its popularity since Shultz work which also provides well-known dataset [11]. Correspondingly, there are many real-world recognition applications that exploit human actions such as surveillance camera, video classifica- tion, sports analysis, human-computer interaction etc which its application becomes more demanding as the hardware quality became more sophisticated. It leads action recogni- tion to be challenging problems since human can perform in many ways and camera can take object in a various man- ner. For instance, in appearance aspect there are many kind Novanto Yudistira cbasemaster@gmail.com 1 Hiroshima University, Higashi Hiroshima, Japan and color of clothes are attached to human. Occlusion is also another problem that sometimes distracts real motion into false or less informative motion. From camera aspect, various scales of human object are captured because of dis- tance matter. Moreover, camera can be static or dynamic which is also emerging problem that remains wide open. From human aspect, action with its variability of speed, background, clothes, illumination is dynamic. To tackle this problem, handcrafted HOF itself cannot be used to describe the variability of dense optical flows. Thus, it is reasonable to make extension to form sequence of HOF and make some sort of decompositions to discover general and distinctive pattern. High level is required to give semantic meaning to classes. These issues lead to many feature representa- tions proposed by researchers to discriminate action types performed by humans. The focus of recognition should be concentrated more into feature representations, especially for action recogni- tion. There are many previous works that proposed vari- ous features whether it is spatiotemporal, template-based,