Multiple Hypothesis Colorization and Its Application to Image Compression Mohammad Haris Baig a,* , Lorenzo Torresani a a Hanover, New Hampshire. United States Abstract In this work we focus on the problem of colorization for image compression. Since color information occupies a large proportion of the total storage size of an image, a method that can predict accurate color from its grayscale version can produce a dramatic reduction in image ﬁle size. But colorization for compression poses several challenges. First, while colorization for artistic purposes simply involves predicting plausible chroma, colorization for compression requires generating output colors that are as close as possible to the ground truth. Second, many objects in the real world exhibit multiple possible colors. Thus, in order to disambiguate the colorization problem some additional information must be stored to reproduce the true colors with good accuracy. To account for the multimodal color distribution of objects we propose a deep tree-structured network that generates for every pixel multiple color hypotheses, as opposed to a single color produced by most prior colorization approaches. We show how to leverage the multimodal output of our model to reproduce with high ﬁdelity the true colors of an image by storing very little additional information. In the experiments we show that our proposed method outperforms traditional JPEG color coding by a large margin, producing colors that are nearly indistinguishable from the ground truth at the storage cost of just a few hundred bytes for high-resolution pictures! Keywords: Colorization, Deep Learning, Image Compression. 1. Introduction Learning to colorize grayscale images is an important task for three main reasons. First, in order to predict the appropriate chroma of objects in an image, a colorization model eﬀectively learns to perform high level understand- ing from unlabeled color images. In other words, it learns to recognize the spatial extents and the prototypical col- ors of semantic segments in the picture. Since unlabeled photos are plentiful, colorization can be used as an unsu- pervised pre-training mechanism for subsequent supervised learning of high-level models for which labeled data may be scarce. Second, colorization can be useful for artistic pursuits by giving new life to grayscale vintage photos and old footage. Finally, colorization models can greatly help with image and video compression. Most objects cannot have all possible colors and by learning the plausible color space for each object we can more compactly encode the color information. In this work we focus predominantly on this last application of colorization, by learning parametric models of image colorization for image compression. We outline the challenges posed by colorization for image com- pression and propose a new deep architecture to overcome these hurdles. Recent successful learning-based approaches [2, 3] for automatic colorization operate under the regime of “zero- * Corresponding author Email addresses: haris@cs.dartmouth.edu (Mohammad Haris Baig), LT@dartmouth.edu (Lorenzo Torresani) cost,” i.e., they assume that the output color must be pre- dicted from the input grayscale image without any ad- ditional storage expense. While this may be reasonable for generating artistic colorization automatically, it is not applicable for the purpose of image compression as many objects in the real world admit multiple plausible colors. The problem is exempliﬁed in Table 6 where we report zero-cost colorization results for diﬀerent methods as well as our approach. While some of the colors produced by these methods are realistic-looking, they are actually quite diﬀerent from the ground truth (ﬁrst column). In order to reproduce with high ﬁdelity the true colors, we propose to store some additional information that helps to disam- biguate between the choices (last column of Table 6). To account for the multimodal color distribution of many objects we propose a convolutional neural network (CNN) that takes as input a grayscale photo and out- puts K plausible color values per image pixel, where K is treated as a hyper-parameter deﬁning the complexity of the model. The multiple outputs are produced by using a CNN structured in the form of a tree, with a single trunk splitting at a given depth into K branches, each generat- ing a candidate color per pixel. The trunk contains con- volutional layers that compute shared features utilized by all branches, while each individual branch predicts a dis- tinct plausible color mode for each pixel. We study how to use this architecture both in the zero-cost setting as well as for compression. In the zero-cost setting, we train the network to choose one of its K candidate outputs at Preprint submitted to Elsevier March 3, 2017 arXiv:1606.06314v3 [cs.CV] 1 Mar 2017