Journal of Signal Processing Systems
https://doi.org/10.1007/s11265-018-1363-x
Deep Packet Flow: Action Recognition via Multiresolution Deep
Wavelet Packet of Local Dense Optical Flows
Novanto Yudistira
1
· Takio Kurita
1
Received: 13 June 2016 / Revised: 21 January 2018 / Accepted: 28 March 2018
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Action recognition with dynamic actor and scene has been a tremendous research topic. Recently, spatio temporal features
such as optical flows have been utilized to define motion representation over sequence of time. However, to increase
accuracy, deep decomposition is necessary either to enrich information under location or time-varying actions due to spatio
temporal dynamics. To this end, we propose algorithm consists of vectors obtained by applying multi-resolution analysis
of motion using Haar Wavelet Packet (HWP) over time. Its computation efficiency and robustness have led HWP to gain
popularity in texture analysis but their applicability in motion analysis is yet to be explored. To extract representation, a
sequence of bin of Histogram of Flow (HOF) is treated as signal channel. Deep decomposition is then applied by utilizing
Wavelet Packet decomposition called Packet Flow to many levels. It allows us to represent action’s motions with various
speeds and ranges which focuses not only on HOF within one frame or one cuboid but also on the temporal sequence. HWP,
however, has translation covariant property that is not efficient in performance because actions occur in arbitrary time and
various sampling location. To gain translation invariant capability, we pool each respective coefficient of decomposition
for each level. It is found that with proper packet selection, it gives comparable results on the KTH action and Hollywood
dataset with train-test division without localization. Even if spatiotemporal cuboid sampling is not densely sampled like of
baseline method, we achieve lower complexity and comparable performance on camera motion burdened dataset like UCF
Sports that motion features such as HOF do not perform well.
Keywords Action recognition · Wavelet packet analysis · Temporal dynamics · Dense optical flows
1 Introduction
Intelligent vision system [19] especially action recognition
is growing topics in computer vision and pattern recogni-
tion. It is gaining its popularity since Shultz work which also
provides well-known dataset [11]. Correspondingly, there
are many real-world recognition applications that exploit
human actions such as surveillance camera, video classifica-
tion, sports analysis, human-computer interaction etc which
its application becomes more demanding as the hardware
quality became more sophisticated. It leads action recogni-
tion to be challenging problems since human can perform
in many ways and camera can take object in a various man-
ner. For instance, in appearance aspect there are many kind
Novanto Yudistira
cbasemaster@gmail.com
1
Hiroshima University, Higashi Hiroshima, Japan
and color of clothes are attached to human. Occlusion is
also another problem that sometimes distracts real motion
into false or less informative motion. From camera aspect,
various scales of human object are captured because of dis-
tance matter. Moreover, camera can be static or dynamic
which is also emerging problem that remains wide open.
From human aspect, action with its variability of speed,
background, clothes, illumination is dynamic. To tackle this
problem, handcrafted HOF itself cannot be used to describe
the variability of dense optical flows. Thus, it is reasonable
to make extension to form sequence of HOF and make some
sort of decompositions to discover general and distinctive
pattern. High level is required to give semantic meaning
to classes. These issues lead to many feature representa-
tions proposed by researchers to discriminate action types
performed by humans.
The focus of recognition should be concentrated more
into feature representations, especially for action recogni-
tion. There are many previous works that proposed vari-
ous features whether it is spatiotemporal, template-based,