IEEE SIGNAL PROCESSING LETTERS, VOL. 26, NO. 8, AUGUST 2019 1187
Three-Stream Network With Bidirectional
Self-Attention for Action Recognition in Extreme
Low Resolution Videos
Didik Purwanto , Rizard Renanda Adhi Pramono, Yie-Tarng Chen , and Wen-Hsien Fang
Abstract—This letter presents a novel three-stream network for
action recognition in extreme low resolution (LR) videos. In con-
trast to the existing networks, the new network uses the trajectory-
spatial network, which is robust against visual distortion, instead of
the pose information to complement the two-stream network. Also,
the new three-stream network is combined with the inflated 3D
ConvNet (I3D) model pre-trained on kinetics to produce more dis-
criminative spatio-temporal features in blurred LR videos. More-
over, a bidirectional self-attention network is aggregated with the
three-stream network to further manifest various temporal depen-
dence among the spatio-temporal features. A new fusion strategy
is devised as well to integrate the information from the three dif-
ferent modalities. Simulations show that the new architecture out-
performs the main state-of-the-art extreme LR action recognition
methods on the HMDB-51 and IXMAS datasets.
Index Terms—Action recognition, low resolution videos, self-
attention, trajectory-spatial network, deep learning.
I. INTRODUCTION
L
OW resolution (LR) videos arise in a variety of disciplines
such as video surveillance [1]–[3], action recognition
[4]–[8], and face detection [9]–[11]. However, LR videos in
general contain less visual information and are susceptible to
noise. It is thus challenging to develop a robust descriptor for
action recognition in LR videos.
A myriad of algorithms has been addressed for action recog-
nition in extreme LR videos. Ryoo et al. [1] introduced inverse
super resolution, which takes advantage of the existing high res-
olution videos in training by learning different types of sub-pixel
transformations. Chen et al. [12] proposed a semi-coupled net-
work, which is based on filter sharing to benefit from high res-
olution training. Rahman et al. [13] combined the handcrafted
and the deep learned features to improve performance. Ryoo
et al. [14] used a two-stream multi-siamese convolutional neu-
ral network (CNN) to learn shared embedding spaces that map
Manuscript received February 27, 2019; revised May 23, 2019; accepted June
6, 2019. Date of publication June 19, 2019; date of current version July 2,
2019. This work was supported by the Ministry of Science and Technology,
China under contracts MOST 107-2221-E-011-124 and MOST 107-2221-E-
011-078-MY2. The associate editor coordinating the review of this manuscript
and approving it for publication was Dr. Yap-Peng Tan. (Corresponding author:
Didik Purwanto.)
The authors are with the Department of Electronic and Computer Engineer-
ing, National Taiwan University of Science and Technology, Taipei 10607, Tai-
wan (e-mail: d10602806@mail.ntust.edu.tw; d10702801@mail.ntust.edu.tw;
ytchen@mail.ntust.edu.tw; whf@mail.ntust.edu.tw).
Digital Object Identifier 10.1109/LSP.2019.2923918
LR videos with the same content to the same location. Also,
Yu et al. [15] proposed a pseudo tensor low rank regularization
to recover inherent robust components of an input video. Xu
et al. [16] proposed a fully-coupled network architecture to gen-
erate robust video representation by incorporating 3D Convolu-
tional and RNN to better capture motion information. However,
the aforementioned methods [1], [12]–[16] did not fully exploit
the temporal relationships among frames, which is beneficial in
learning action recognition when there is a substantial loss of
spatial information. Some recent approaches such as 3D skele-
tal [17], [18] or differential images [19] were also considered
for action recognition, but they were not devised for extreme
LR videos.
In this letter, we present a novel three-stream network for
action recognition in extreme LR videos. To resolve the visual
degradation, in contrast to the existing ones [20], [21], our three-
stream network uses trajectory patterns in the Hue, Saturation,
Value (HSV) color space instead of the pose information to en-
code trajectory temporal dynamic information, the former of
which is more robust against visual distortion, to complement
the well-known two-stream networks. Also, the new network
is combined with the inflated 3D ConvNet (I3D) model [22]
pre-trained on Kinetics to produce more discriminative spatio-
temporal features in blurred LR videos. Moreover, a bidirec-
tional self-attention network is aggregated with the three-stream
network to further manifest various temporal dependency among
spatio-temporal features. A new fusion strategy is devised as
well to integrate the information from the three different modal-
ities. Simulations show that the new approach provides superior
performance over the state-of-the-art extreme LR action recog-
nition methods on the HMDB-51 and IXMAS datasets.
The contributions of this letter can be summarized as fol-
lows: i) we employ the trajectory-spatial information to capture
the fine-grained motion in extreme LR videos, which can com-
plement the conventional two-stream network; ii) we propose
a new architecture, which combines the three-stream network
with a bidirectional self-attention network based on a new pair-
wise similarity function to leverage the temporal dependency
information; (iii) we design a new fusion strategy to effectively
aggregate the outputs from the three different modalities; iv)
we demonstrate that the I3D model pre-trained on a large-scale
video dataset such as Kinetics can benefit action classification
in extreme LR videos.
1070-9908 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.