Disregarding the Big Picture: Towards Local Image Quality Assessment Oliver Wiedemann, Vlad Hosu, Hanhe Lin and Dietmar Saupe Department of Computer and Information Science, University of Konstanz, Germany Email: {oliver.wiedemann, vlad.hosu, hanhe.lin, dietmar.saupe}@uni-konstanz.de Abstract—Image quality has been studied almost exclusively as a global image property. It is common practice for IQA databases and metrics to quantify this abstract concept with a single number per image. We propose an approach to blind IQA based on a convolutional neural network (patchnet) that was trained on a novel set of 32,000 individually annotated patches of 64×64 pixel. We use this model to generate spatially small local quality maps of images taken from KonIQ-10k, a large and diverse in-the-wild database of authentically distorted images. We show that our local quality indicator correlates well with global MOS, going beyond the predictive ability of quality related attributes such as sharpness. Averaging of patchnet predictions already outperforms classical approaches to global MOS prediction that were trained to include global image features. We additionally experiment with a generic second-stage aggregation CNN to estimate mean opinion scores. Our latter model performs comparable to the state of the art with a PLCC of 0.81 on KonIQ-10k. I. I NTRODUCTION Digital images pass through an intricate processing pipeline from being captured to being presented to a human observer. Flaws and limitations of the endpoint devices and performance trade-offs in the algorithms used for transport and storage (e.g. compression) may result in a reduced perceived visual quality. Accurate and generally valid objective image quality assessment (IQA) methods have numerous applications in the multimedia domain since manual inspection is costly and time-consuming. For example, media outlets and graphic design companies can simplify their search for usable source materials by ﬁltering for content of sufﬁcient quality, service providers can measure the performance of their products or mitigate ongoing problems with respect to content quality, etc. Subjective studies are known to yield reliable opinions for both artiﬁcially distorted image datasets [1], [2] where the severity of particular degradations is known as well as for in-the-wild collections of images [3], [4] with authentic and unknown mixtures of distortions. The common benchmark for objective quality measures is their ability to estimate mean opinion scores (MOS) acquired from a sufﬁcient large number of observers [5]. It is possible to distinguish objective IQA methods by their requirements regarding additional information besides the image under assessment. Full-reference methods, such as the PSNR, need access to a pristine original. Reduced-reference algorithms only require partial information, e.g. the type of the predominant distortion in the given image. No-reference image quality assessment (NR-IQA) methods do not require additional information. In this paper, we introduce an approach to local NR-IQA that applies to the wide range of distortion present on images in-the-wild. Quality is generally considered as a property of the entire image, evaluated via the MOS of a group of observers. This is the point of view that previous IQA methods have taken. Some works consider that each part of the image contributes independently [6] to the overall quality score, whereas others assign different weights [7] to build a better global quality estimate. We hypothesize that quality can be understood as both a local property of an image patch of a sufﬁciently large size as well as a property of an entire image. In our IQA approach, we intend to rely on the assessed quality of individual patches. To this end, we created a novel dataset of manually quality- annotated RGB patches sampled from KonIQ-10k [4]. We build a local patch-level quality prediction CNN architecture and train it on our patch dataset. As far as we are aware of, we are the ﬁrst to consider to directly predict the quality of individual patches, without making any indirect assumptions about the correspondence between the global and local quality scores. We expect our predictor to be more representative of the low level technical aspects of quality, without having been inﬂuenced by content or other higher level factors, such as aesthetics or image composition. For comparison reasons, we also include two approaches to global MOS prediction: Patchnet is used in a sliding-window fashion to create spatially small quality maps of authentically distorted images taken from KonIQ-10k. The average value of these maps already correlates highly with the global MOS score. Furthermore, we augment our quality maps with two other local low-level indicators, namely the FISH sharpness metric [8] and brightness information in the form of gray-scale version of the original input. We then study the performance of a generic feature aggregator based on a DenseNet-169 CNN [9]. Our results show that the correlations between the mean values of patchnet quality maps and global MOS values on KonIQ-10k are already comparable to the best-performing global statistical methods that were ﬁne-tuned on the respec- tive dataset. Aggregation of all three of our spatially small feature maps by a second-stage CNN outperforms all classical methods and the naive patch-based deep learning methods. We expected this approach to be falling short of the global performance of models that traded incorporating additional information (e.g. content) for the ability to predict local quality