Learning Convolutional Proximal Filters Ulugbek S. Kamilov, Hassan Mansour, and Dehong Liu Mitsubishi Electric Research Laboratories (MERL) 201 Broadway, Cambridge, MA, 02139, USA Email: kamilov@merl.com, mansour@merl.com, and liudh@merl.com Abstract—In the past decade, sparsity-driven methods have led to sub- stantial improvements in the capabilities of numerous imaging systems. While traditionally such methods relied on analytical models of sparsity, such as total variation (TV) or wavelet regularization, recent methods are increasingly based on data-driven models such as dictionary-learning or convolutional neural networks (CNN). In this work, we propose a new trainable model based on the proximal operator for TV. By interpreting the popular fast iterative shrinkage/thresholding algorithm (FISTA) as a CNN, we train the filters of the algorithm to minimize the error over a training data-set. Experiments on image denoising show that by training the filters, one can substantially boost the performance of the algorithm and make it competitive with other state-of-the-art methods. I. I NTRODUCTION We consider an imaging inverse problem y = Hx + e, where the goal is to recover the unknown image x ∈ R N from the noisy measurements y ∈ R M . The matrix H ∈ R M×N is known and models the response of the acquisition device, while the vector e ∈ R M represents the unknown noise in the measurements. Practical imaging inverse problems are often ill-posed [1]. A standard approach for solving such problems is the regularized least- squares estimator x = arg min x∈R N 1 2 ‖y − Hx‖ 2 ℓ 2 + R(x) , (1) where R is a regularizer promoting solutions with desirable proper- ties. One of the most popular regularizers for images is the total variation (TV) [2], defined as R(x) τ ‖Dx‖ ℓ 1 , where τ > 0 is parameter that controls the strength of the regularization, and D : R N → R N×K is the discrete gradient operator. The gradient can be represented with K separate filters, D (D1,..., DK), computing finite-differences along each dimension of the image. Two common methods for solving the TV regularized problem (1) are fast iterative shrinkage/thresholding algorithm (FISTA) [3] and alternating direction method of multipliers (ADMM) [4]. These algorithms are among the methods of choice for solving large-scale imaging problems due to their ability to handle the non-smoothness of TV and their low-computational complexity. Both FISTA and ADMM typically combine the operations with the measurement matrix with applications of the proximal operator prox τ R (y) arg min x∈R N 1 2 ‖x − y‖ 2 ℓ 2 + τ R(x) . (2) Beck and Teboulle [3] have proposed an efficient dual domain FISTA for computing TV proximal s t = g t−1 + ((qt−1 − 1)/qt )g t−2 (3a) z t = s t − γτ D(τ D T s − y) (3b) g t = P∞(z t ), (3c) with q0 =1 and g 0 = g −1 = ginit ∈ R N×K . Here, P∞ denotes a component-wise projection operator onto a unit ℓ∞-norm ball, γ =1/L with L = τ 2 λmax(D T D) is a step-size, and {qt } t∈N are relaxation parameters. For a fixed qt =1, the guaranteed global convergence speed of the algorithm is O(1/t); however, the choice qt = 1 2 (1 + √ 1+4qt−1) leads to a faster O(1/t 2 ) convergence [3]. The final denoised image after T iterations of (3) is obtained as x T = y − τ D T g T . II. MAIN RESULTS Our goal is to obtain a trainable variant of (3) by replacing the finite-difference filters of TV with K adaptable, iteration-dependent filters. The corresponding algorithm, illustrated in Fig. 1, can be interpreted as a convolutional neural network (CNN) of a particular structure with T × K filters Dt (Dt1,..., DtK) that are learned from a set of L training examples {x ℓ , y ℓ } ℓ∈[1,...,L] . The filters can be optimized by minimizing the error θ = arg min θ∈Θ 1 L L ℓ=1 E ℓ (θ) with E (θ) ‖x − x(y; θ)‖ 2 ℓ 2 (4) over the training set, where θ = {Dt } t∈[1,...,T ] ∈ Θ denotes the set of desirable filters. For the problem of image denoising, end-to- end optimization can be performed with the error backpropagation algorithm [5] that produces [∇E ℓ (θ)] tk = q tk + τ (g T k • (x − x)) for t = T q tk for 1 ≤ t ≤ T − 1, using the following iteration for t = T,T − 1,..., 1, v t−1 = diag ( P ′ ∞ (z t ) ) r t (5a) b t−1 = v t−1 − γτ 2 Dt D T t v t−1 (5b) r t−1 = µt b t−1 + (1 − µt+1)b t (5c) q tk = γτ [(v t−1 k • (y − τ D T t s t )) − τ (s t k • (D T t v t−1 ))] (5d) where • denotes filtering, µt =1 − (1 − qt−1)/qt , b T =0, and r T = τ DT (x − x). The parameters are update iteratively with the standard stochastic gradient method as θ ← θ − α∇E ℓ (θ). We applied our method to image denoising by training T = 10 iterations of the algorithm with K =9 iteration dependent kernels of size 6 × 6 pixels. For taining, we used 400 images from Berkeley dataset [6] cropped to 192 × 192 pixels. We evaluated the algorithm on 68 separate test images from the dataset and compared the results with three popular denoising algorithms (see Table I and Fig. 2–3). Our basic MATLAB implementation takes 0.69 and 3.27 seconds on images of 256 × 256 and 512 × 512 pixels, respectively, on an Apple iMac with a 4 GHz Intel Core i7 processor. We observe that our simple extension of TV significantly boosts the performance of the algorithm and makes it competitive with state-of-the-art denoising algorithms. The algorithm can be easily incorporated into FISTA and ADMM for solving more general inverse problems. Future work will address such extensions and further improve the performance by code optimization and considering more kernels. More generally, our work contributes to the recent efforts to boost the performance of imaging algorithms by incorporating latest ideas from deep learning [7]–[13].