Learning Convolutional Proximal Filters Ulugbek S. Kamilov, Hassan Mansour, and Dehong Liu Mitsubishi Electric Research Laboratories (MERL) 201 Broadway, Cambridge, MA, 02139, USA Email: kamilov@merl.com, mansour@merl.com, and liudh@merl.com Abstract—In the past decade, sparsity-driven methods have led to sub- stantial improvements in the capabilities of numerous imaging systems. While traditionally such methods relied on analytical models of sparsity, such as total variation (TV) or wavelet regularization, recent methods are increasingly based on data-driven models such as dictionary-learning or convolutional neural networks (CNN). In this work, we propose a new trainable model based on the proximal operator for TV. By interpreting the popular fast iterative shrinkage/thresholding algorithm (FISTA) as a CNN, we train the ﬁlters of the algorithm to minimize the error over a training data-set. Experiments on image denoising show that by training the ﬁlters, one can substantially boost the performance of the algorithm and make it competitive with other state-of-the-art methods. I. I NTRODUCTION We consider an imaging inverse problem y = Hx + e, where the goal is to recover the unknown image x ∈ R N from the noisy measurements y ∈ R M . The matrix H ∈ R M×N is known and models the response of the acquisition device, while the vector e ∈ R M represents the unknown noise in the measurements. Practical imaging inverse problems are often ill-posed [1]. A standard approach for solving such problems is the regularized least- squares estimator  x = arg min x∈R N  1 2 ‖y − Hx‖ 2 ℓ 2 + R(x)  , (1) where R is a regularizer promoting solutions with desirable proper- ties. One of the most popular regularizers for images is the total variation (TV) [2], deﬁned as R(x)  τ ‖Dx‖ ℓ 1 , where τ > 0 is parameter that controls the strength of the regularization, and D : R N → R N×K is the discrete gradient operator. The gradient can be represented with K separate ﬁlters, D  (D1,..., DK), computing ﬁnite-differences along each dimension of the image. Two common methods for solving the TV regularized problem (1) are fast iterative shrinkage/thresholding algorithm (FISTA) [3] and alternating direction method of multipliers (ADMM) [4]. These algorithms are among the methods of choice for solving large-scale imaging problems due to their ability to handle the non-smoothness of TV and their low-computational complexity. Both FISTA and ADMM typically combine the operations with the measurement matrix with applications of the proximal operator prox τ R (y)  arg min x∈R N  1 2 ‖x − y‖ 2 ℓ 2 + τ R(x)  . (2) Beck and Teboulle [3] have proposed an efﬁcient dual domain FISTA for computing TV proximal s t = g t−1 + ((qt−1 − 1)/qt )g t−2 (3a) z t = s t − γτ D(τ D T s − y) (3b) g t = P∞(z t ), (3c) with q0 =1 and g 0 = g −1 = ginit ∈ R N×K . Here, P∞ denotes a component-wise projection operator onto a unit ℓ∞-norm ball, γ =1/L with L = τ 2 λmax(D T D) is a step-size, and {qt } t∈N are relaxation parameters. For a ﬁxed qt =1, the guaranteed global convergence speed of the algorithm is O(1/t); however, the choice qt = 1 2 (1 + √ 1+4qt−1) leads to a faster O(1/t 2 ) convergence [3]. The ﬁnal denoised image after T iterations of (3) is obtained as x T = y − τ D T g T . II. MAIN RESULTS Our goal is to obtain a trainable variant of (3) by replacing the ﬁnite-difference ﬁlters of TV with K adaptable, iteration-dependent ﬁlters. The corresponding algorithm, illustrated in Fig. 1, can be interpreted as a convolutional neural network (CNN) of a particular structure with T × K ﬁlters Dt  (Dt1,..., DtK) that are learned from a set of L training examples {x ℓ , y ℓ } ℓ∈[1,...,L] . The ﬁlters can be optimized by minimizing the error  θ = arg min θ∈Θ  1 L L  ℓ=1 E ℓ (θ)  with E (θ)  ‖x −  x(y; θ)‖ 2 ℓ 2 (4) over the training set, where θ = {Dt } t∈[1,...,T ] ∈ Θ denotes the set of desirable ﬁlters. For the problem of image denoising, end-to- end optimization can be performed with the error backpropagation algorithm [5] that produces [∇E ℓ (θ)] tk =  q tk + τ (g T k • (x −  x)) for t = T q tk for 1 ≤ t ≤ T − 1, using the following iteration for t = T,T − 1,..., 1, v t−1 = diag ( P ′ ∞ (z t ) ) r t (5a) b t−1 = v t−1 − γτ 2 Dt D T t v t−1 (5b) r t−1 = µt b t−1 + (1 − µt+1)b t (5c) q tk = γτ [(v t−1 k • (y − τ D T t s t )) − τ (s t k • (D T t v t−1 ))] (5d) where • denotes ﬁltering, µt =1 − (1 − qt−1)/qt , b T =0, and r T = τ DT (x −  x). The parameters are update iteratively with the standard stochastic gradient method as θ ← θ − α∇E ℓ (θ). We applied our method to image denoising by training T = 10 iterations of the algorithm with K =9 iteration dependent kernels of size 6 × 6 pixels. For taining, we used 400 images from Berkeley dataset [6] cropped to 192 × 192 pixels. We evaluated the algorithm on 68 separate test images from the dataset and compared the results with three popular denoising algorithms (see Table I and Fig. 2–3). Our basic MATLAB implementation takes 0.69 and 3.27 seconds on images of 256 × 256 and 512 × 512 pixels, respectively, on an Apple iMac with a 4 GHz Intel Core i7 processor. We observe that our simple extension of TV signiﬁcantly boosts the performance of the algorithm and makes it competitive with state-of-the-art denoising algorithms. The algorithm can be easily incorporated into FISTA and ADMM for solving more general inverse problems. Future work will address such extensions and further improve the performance by code optimization and considering more kernels. More generally, our work contributes to the recent efforts to boost the performance of imaging algorithms by incorporating latest ideas from deep learning [7]–[13].