1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2885236, IEEE Transactions on Image Processing IEEE TRANSACTION ON IMAGE PROCESSING 1 Light Field Spatial Super-Resolution Using Deep Efﬁcient Spatial-Angular Separable Convolution Henry Wing Fung Yeung, Student Member, IEEE, Junhui Hou, Member, IEEE, Xiaoming Chen, Member, IEEE, Jie Chen, Member, IEEE, Zhibo Chen, Senior Member, IEEE, and Yuk Ying Chung, Member, IEEE Abstract—Light ﬁeld (LF) photography is an emerging paradigm for capturing more immersive representations of the real-world. However, arising from the inherent trade-off between the angular and spatial dimensions, the spatial resolution of LF images captured by commercial micro-lens based LF cameras are signiﬁcantly constrained. In this paper, we propose effective and efﬁcient end-to-end convolutional neural network models for spatially super-resolving LF images. Speciﬁcally, the pro- posed models have an hourglass shape, which allows feature extraction to be performed at the low resolution level to save both computational and memory costs. To fully make use of the four-dimensional (4-D) structure information of LF data in both spatial and angular domains, we propose to use 4-D convolution to characterize the relationship among pixels. Moreover, as an approximation of 4-D convolution, we also propose to use spatial- angular separable (SAS) convolutions for more computationally- and memory-efﬁcient extraction of spatial-angular joint features. Extensive experimental results on 57 test LF images with various challenging natural scenes show signiﬁcant advantages from the proposed models over state-of-the-art methods. That is, an average PSNR gain of more than 3.0 dB and better visual quality are achieved, and our methods preserve the LF structure of the super-resolved LF images better, which is highly desirable for subsequent applications. In addition, the SAS convolution- based model can achieve 3× speed up with only negligible reconstruction quality decrease when compared with the 4-D convolution-based one. The source code of our method is online available at https://github.com/spatialsr/DeepLightFieldSSR. Index Terms—Light ﬁeld, super-resolution, convolutional neu- ral networks I. I NTRODUCTION A S a promising technology for capturing the real-world in a more immersive manner, light ﬁeld (LF) imaging [1] not only records the accumulated intensity at each image This work was supported in part by the CityU Start-up Grant for New Faculty under Grant 7200537/CS, in part by the Hong Kong RGC Early Career Scheme Funds 9048123 (CityU 21211518), and in part by the Natural Science Foundation of China under Grant 61873142. (H. Yeung and J. Hou contributed equally to this paper.) (Corresponding authors: J. Hou, X. Chen, and Z. Chen.) H. Yeung and Y. Y. Chung are with the School of Informa- tion Technologies, the University of Sydney, NSW, Australia. (Email: hyeu8081@uni.sydney.edu.au and vera.chung@sydney.edu.au.) J. Hou is with the Department of Computer Science, City Univer- sity of Hong Kong, Hong Kong, and is also with City University of Hong Kong Shenzhen Research Institute, Shenzhen, 51800, China. (Email: jh.hou@cityu.edu.hk) X. Chen is with the Institute of Advanced Technology, University of Science and Technology of China, China. (Email: xiaoming.chen@iat.ustc.edu.cn) J. Chen is with the School of Electrical and Electronics Engi- neering, Nanyang Technological University, Singapore, 639798. Email: chen.jie@ntu.edu.sg Z. Chen is with the School of Information Science and Technology, University of Science and Technology of China, China. (Email: chen- zhibo@ustc.edu.cn) Object camera sensor Camera main lens (angular plane) Micro-lens array (spatial plane) Camera sensor output (lenselet image) Converting (a) (b) (c) micro-images The ( )-th SAI Fig. 1. Schematic of micro-lens based LF imaging. The angular resolution of an LF image (i.e., the number of decoded SAIs) is related to the number of sensor pixels located behind each micro-lens, while the spatial resolution of an LF image (i.e., the resolution of each decoded SAI) is determined by the number of micro-lenses. point (i.e., spatial information), but also separates intensity values for each ray direction (i.e., angular information). Alter- natively, the resulting LF image implicitly encodes the three- dimensional (3-D) geometry information of the scene, which facilitates a wide range of applications, such as image post- refocus [2], depth inference [3], [4], 3-D reconstruction [5], virtual/augmented reality [6], to just name a few. Especially, recent advance in commercial hand-held light ﬁeld cameras, e.g., Lytro Illum [7] and Raytrix [8], opens up the possibility for convenient acquisition of LF images, making research in LF image processing increasingly popular. We refer the readers to [9] for a comprehensive survey on LF imaging and processing. A. Representation of LF and Motivation As illustrated in Fig. 1(a), a four-dimensional (4-D) LF can be represented with the commonly-used two-plane pa- rameterization, where the light ray travels and intersects the angular plane (s, t) and the spatial plane (x, y). LF camera can be constructed by inserting a micro-lens array in front of the sensor plane of a conventional camera [10], [11]. The micro-lenses diverge the focused light shaft onto the camera sensor. The output data of the sensor is known as a lenslet image shown in Fig. 1(b), which is composed of many micro-images under each micro-lens. The lenslet image can be further converted into multiple sub-aperture images (SAIs) shown in Fig. 1(c), which capture the same target scene from slightly different viewing directions. For the micro-lens based hand-held LF camera, e.g., Lytro Illum, the angular resolution (or the number of decoded SAIs) is related to the number of sensor pixels located behind each micro-lens, while the spatial resolution (i.e., the resolution of