IEEE SIGNAL PROCESSING LETTERS, VOL. 26, NO. 8, AUGUST 2019 1187 Three-Stream Network With Bidirectional Self-Attention for Action Recognition in Extreme Low Resolution Videos Didik Purwanto , Rizard Renanda Adhi Pramono, Yie-Tarng Chen , and Wen-Hsien Fang Abstract—This letter presents a novel three-stream network for action recognition in extreme low resolution (LR) videos. In con- trast to the existing networks, the new network uses the trajectory- spatial network, which is robust against visual distortion, instead of the pose information to complement the two-stream network. Also, the new three-stream network is combined with the inﬂated 3D ConvNet (I3D) model pre-trained on kinetics to produce more dis- criminative spatio-temporal features in blurred LR videos. More- over, a bidirectional self-attention network is aggregated with the three-stream network to further manifest various temporal depen- dence among the spatio-temporal features. A new fusion strategy is devised as well to integrate the information from the three dif- ferent modalities. Simulations show that the new architecture out- performs the main state-of-the-art extreme LR action recognition methods on the HMDB-51 and IXMAS datasets. Index Terms—Action recognition, low resolution videos, self- attention, trajectory-spatial network, deep learning. I. INTRODUCTION L OW resolution (LR) videos arise in a variety of disciplines such as video surveillance [1]–[3], action recognition [4]–[8], and face detection [9]–[11]. However, LR videos in general contain less visual information and are susceptible to noise. It is thus challenging to develop a robust descriptor for action recognition in LR videos. A myriad of algorithms has been addressed for action recog- nition in extreme LR videos. Ryoo et al. [1] introduced inverse super resolution, which takes advantage of the existing high res- olution videos in training by learning different types of sub-pixel transformations. Chen et al. [12] proposed a semi-coupled net- work, which is based on ﬁlter sharing to beneﬁt from high res- olution training. Rahman et al. [13] combined the handcrafted and the deep learned features to improve performance. Ryoo et al. [14] used a two-stream multi-siamese convolutional neu- ral network (CNN) to learn shared embedding spaces that map Manuscript received February 27, 2019; revised May 23, 2019; accepted June 6, 2019. Date of publication June 19, 2019; date of current version July 2, 2019. This work was supported by the Ministry of Science and Technology, China under contracts MOST 107-2221-E-011-124 and MOST 107-2221-E- 011-078-MY2. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Yap-Peng Tan. (Corresponding author: Didik Purwanto.) The authors are with the Department of Electronic and Computer Engineer- ing, National Taiwan University of Science and Technology, Taipei 10607, Tai- wan (e-mail: d10602806@mail.ntust.edu.tw; d10702801@mail.ntust.edu.tw; ytchen@mail.ntust.edu.tw; whf@mail.ntust.edu.tw). Digital Object Identiﬁer 10.1109/LSP.2019.2923918 LR videos with the same content to the same location. Also, Yu et al. [15] proposed a pseudo tensor low rank regularization to recover inherent robust components of an input video. Xu et al. [16] proposed a fully-coupled network architecture to gen- erate robust video representation by incorporating 3D Convolu- tional and RNN to better capture motion information. However, the aforementioned methods [1], [12]–[16] did not fully exploit the temporal relationships among frames, which is beneﬁcial in learning action recognition when there is a substantial loss of spatial information. Some recent approaches such as 3D skele- tal [17], [18] or differential images [19] were also considered for action recognition, but they were not devised for extreme LR videos. In this letter, we present a novel three-stream network for action recognition in extreme LR videos. To resolve the visual degradation, in contrast to the existing ones [20], [21], our three- stream network uses trajectory patterns in the Hue, Saturation, Value (HSV) color space instead of the pose information to en- code trajectory temporal dynamic information, the former of which is more robust against visual distortion, to complement the well-known two-stream networks. Also, the new network is combined with the inﬂated 3D ConvNet (I3D) model [22] pre-trained on Kinetics to produce more discriminative spatio- temporal features in blurred LR videos. Moreover, a bidirec- tional self-attention network is aggregated with the three-stream network to further manifest various temporal dependency among spatio-temporal features. A new fusion strategy is devised as well to integrate the information from the three different modal- ities. Simulations show that the new approach provides superior performance over the state-of-the-art extreme LR action recog- nition methods on the HMDB-51 and IXMAS datasets. The contributions of this letter can be summarized as fol- lows: i) we employ the trajectory-spatial information to capture the ﬁne-grained motion in extreme LR videos, which can com- plement the conventional two-stream network; ii) we propose a new architecture, which combines the three-stream network with a bidirectional self-attention network based on a new pair- wise similarity function to leverage the temporal dependency information; (iii) we design a new fusion strategy to effectively aggregate the outputs from the three different modalities; iv) we demonstrate that the I3D model pre-trained on a large-scale video dataset such as Kinetics can beneﬁt action classiﬁcation in extreme LR videos. 1070-9908 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.