IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Revisiting Video Saliency Prediction in the Deep Learning Era Wenguan Wang, Member, IEEE , Jianbing Shen, Senior Member, IEEE , Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji Abstract—Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K. Index Terms—Video saliency, dynamic visual attention, benchmark, deep learning. 1 I NTRODUCTION H UMAN visual system (HVS) has an astonishing ability to quickly select and concentrate on important regions in the visual field. This cognitive process allows humans to selectively process a vast amount of visual information and attend to important parts of a crowded scene while ignoring irrelevant information. This selective mechanism, known as visual attention, allows humans to interpret complex scenes in real time. Over the last few decades, several computational models have been proposed for imitating attentional mechanisms of HVS during static scene viewing. Significant advances have been achieved recently thanks to the rapid spread of deep learning techniques and the availability of large-scale static gaze datasets (e.g., SALICON [2]). In stark contrast, predict- ing observers’ fixations during dynamic scene free-viewing has been under-explored. This task, referred to as dynamic fixation prediction or video saliency detection, is essential for understanding human attention behaviors and has various practical real-word applications (e.g., video captioning [3], W. Wang and J. Shen are with Beijing Laboratory of Intelligent In- formation Technology, School of Computer Science, Beijing Institute of Technology, and also with Inception Institute of Artificial Intelligence, UAE. (Email: wenguanwang.ai@gmail.com, shenjianbing@bit.edu.cn) J. Xie is with Hikvision Research Institute, USA. (Email: Jian- wen.Xie@hikvision.com) M.-M. Cheng is with College of Computer Science, Nankai University. (Email: cmm@nankai.edu.cn) H. Ling is with the Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA. (Email: hbling@temple.edu) A. Borji is with MarkableAI. (Email: aliborji@gmail.com) A preliminary version of this work has appeared in CVPR 2018 [1]. Corresponding author: Jianbing Shen compression [4], question answering [5], object segmenta- tion [6], action recognition [7], etc.). It is thus highly desired to have a standard, high-quality benchmark composed of diverse and representative video stimuli. Existing datasets are severely limited in their coverage and scalability, and only include special scenarios such as limited human activi- ties. They lack generic, representative, and diverse instances in unconstrained task-independent scenarios. Consequently, existing datasets fail to offer a rich set of fixations for learning video saliency and to assess models. Moreover, they do not provide an evaluation server with a standalone held out test set to avoid potential dataset over-fitting. While saliency benchmarks (e.g., MIT300 [8] and SALI- CON [2]) have been very instrumental in progressing the static saliency field [9], such standard widespread bench- marks are missing for video saliency modeling. We believe such benchmarks are highly desired to drive the field for- ward. To this end, we propose a new benchmark, named DHF1K (Dynamic Human Fixation 1K), with a public server for reporting evaluation results on a preserved test set. DHF1K comes with a dataset that is unique in terms of generality, diversity and difficulty. It has 1K videos with over 600K frames and per-frame fixation annotations from 17 observers. The sequences have been carefully collected to cover diverse scenes, motion patterns, object categories, and activities. DHF1K is accompanied by a comprehen- sive evaluation of 23 state-of-the-art approaches [10]–[31]. Moreover, each video is annotated with a main category label (e.g., daily activities, animals) and rich attributes (e.g., camera/content movement, scene lighting, presence of hu- mans), which facilitate deeper understanding of gaze guid-