LEARNING IMAGE AESTHETICS BY LEARNING INPAINTING June Hao Ching John See Lai-Kuan Wong Visual Processing Lab, Faculty of Computing and Informatics, Multimedia University, Cyberjaya, Malaysia ABSTRACT Due to the high capability of learning robust features, con- volutional neural networks (CNN) are becoming a mainstay solution for many computer vision problems, including aes- thetic quality assessment (AQA). However, there remains the issue that learning with CNN requires time-consuming and expensive data annotations especially for a task like AQA. In this paper, we present a novel approach to AQA that in- corporates self-supervised learning (SSL) by learning how to inpaint images according to photographic rules such as rules- of-thirds and visual saliency. We conduct extensive quantita- tive experiments on a variety of pretext tasks and also differ- ent ways of masking patches for inpainting, reporting fairer distribution-based metrics. We also show the suitability and practicality of the inpainting task which yielded comparably good benchmark results with much lighter model complexity. Index Terms— Aesthetic quality assessment, CNN, self- supervised learning, image inpainting, photographic rules 1. INTRODUCTION With the advancement of mobile camera technology and the growth of social media, online photo sharing has become an increasingly popular phenomenon. As such, personal gal- leries or media retrieval systems are also inundated with a massive deluge of images; many which could be of poor qual- ity or lack in appeal. The growing interest in aesthetic quality assessment (AQA) in recent years [1, 2, 3] is testament of the need to automate the process of selecting or sorting out im- ages from the perspective of aesthetic appeal. In the early days, most of the works that proposed for AQA were designing hand-crafted features that correspond to known aesthetic principles such as low-level features that are based on photographic rules [4], and SIFT or color descrip- tors [5]. With the success of deep learning, researchers started to use CNN-based models in their works [6, 7, 3, 1, 8], and these methods easily outperform the handcrafted methods by a significant margin. Although work on AQA using deep learning techniques outperformed most traditional feature extraction methods, the initial data collection and annotation works are most essential to the success of using a heavily supervised method like CNN. Fig. 1. Image inpainting according to photographic rules as a self-supervised learning (SSL) pretext task for AQA. This is particularly challenging and expensive for a subjective task like AQA as opinions need to be collected from many professional photographers to provide useful ratings of the aesthetics of an image. Self-supervised learning (SSL) offers a new paradigm towards learning visual features from an un- labeled dataset on a pretext task (with pseudo labels) before transferring to (the actual) downstream supervised prediction task. In recent years, works like [9, 10, 11, 12, 13] proposed different SSL pretext tasks trained on ImageNet [14], which reported strong capabilities at various downstream tasks. Hence, we are motivated to design a viable SSL pretext task that can incorporate photographic rules for to better under- stand image aesthetics. By teaching the machine to inpaint portions of the image that corresponds closely to aesthetic concepts, we hypothesize that the model will also learn in- trinsic knowledge and features of these concepts, which in turn, can perform the AQA task well. In this paper, we propose a novel approach to AQA by incorporating SSL based on image inpainting. The main con- tributions in this work are as follows: 1. We propose new ways of performing image inpaint- ing based on compositional rules (rule of thirds, visual saliency) as a self-supervising pretext task for the CNN before transferring to the downstream supervised AQA task. Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 03,2021 at 07:29:54 UTC from IEEE Xplore. Restrictions apply.