IEEE TRANSACTIONS ON IMAGE PROCESSING 1 WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification Qiufu Li, Linlin Shen * , Sheng Guo, Zhihui Lai Abstract—Though widely used in image classification, convolu- tional neural networks (CNNs) are prone to noise interruptions, i.e. the CNN output can be drastically changed by small image noise. To improve the noise robustness, we try to integrate CNNs with wavelet by replacing the common down-sampling (max- pooling, strided-convolution, and average pooling) with discrete wavelet transform (DWT). We firstly propose general DWT and inverse DWT (IDWT) layers applicable to various orthogonal and biorthogonal discrete wavelets like Haar, Daubechies, and Cohen, etc., and then design wavelet integrated CNNs (WaveCNets) by integrating DWT into the commonly used CNNs (VGG, ResNets, and DenseNet). During the down-sampling, WaveCNets apply DWT to decompose the feature maps into the low-frequency and high-frequency components. Containing the main information including the basic object structures, the low-frequency compo- nent is transmitted into the following layers to generate robust high-level features. The high-frequency components are dropped to remove most of the data noises. The experimental results show that WaveCNets achieve higher accuracy on ImageNet than various vanilla CNNs. We have also tested the performance of WaveCNets on the noisy version of ImageNet, ImageNet-C and six adversarial attacks, the results suggest that the proposed DWT/IDWT layers could provide better noise-robustness and adversarial robustness. When applying WaveCNets as backbones, the performance of object detectors (i.e., faster R-CNN and RetinaNet) on COCO detection dataset are consistently improved. We believe that suppression of aliasing effect, i.e. separation of low frequency and high frequency information, is the main advantages of our approach. The code of our DWT/IDWT layer and different WaveCNets are available at https://github.com/CVI- SZU/WaveCNet. Index Terms—CNN, down-sampling, aliasing effect, wavelet transform layers, noise-robustness, basic object structure. I. I NTRODUCTION S MALL noise, including the common spatial noise [1] and the specially designed adversarial noise [2]–[6], can drastically change the final predication of well trained convolu- tional neuronal network (CNN) for image classification. The recent studies [7], [8] show that the noise may be enlarged The work is supported by the Natural Science Foundation of China under grants no. 62006156, 91959108 and U1713214, and the Science and Technology Project of Guangdong Province under grant no. 2018A050501014. Corresponding author: Linlin Shen. Q. Li, L. Shen, and Z. Lai are with the Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China, Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen 518060, China, and Guangdong Key Laboratory of Intelli- gent Information Processing, Shenzhen University, Shenzhen 518060, China (e-mail: qiufu li 1988@163.com; llshen@szu.edu.cn; lai zhi hui@163.com). S. Guo is with MyBank, Ant Group, Hangzhou 310012, China (e-mail: guosheng.guosheng@alibaba-inc.com). DWT IDWT Max-pooling Max-pooling image X hl X lh X hh Original image X A AW X ll AP BW BP B B BP BW A AP AW Fig. 1: Comparison of max-pooling and wavelet transforms. Max-pooling is a commonly used down-sampling operation in the deep networks, which could easily breaks the basic object structures. Discrete Wavelet Transform (DWT) decomposes an image X into its low-frequency component X ll and high- frequency components X lh , X hl , X hh . While X lh , X hl , X hh represent image details including most of the noise, X ll is a low resolution version of the image, where the basic object structures are represented. In the figures, the window boundary in area A (AP) and the poles in area B (BP) are broken by max-pooling, while the principal features of these objects are kept in the DWT output (AW and BW). as the image data flows through the deep networks. These phenomena illustrate the weak noise-robustness of CNNs. The weak noise-robustness of CNNs is closely related to the down-sampling. The commonly used down-sampling operations in deep networks, such as max-pooling, average- pooling, and strided-convolution, usually ignore the classic sampling theorem [9], which result in aliasing among the data components in different frequency intervals. While the noise of data is mostly high-frequency components, the low-frequency component contains the main information, such as the basic object structures. Therefore, the aliasing introduces residual noise in the down-sampled data and breaks the basic structures, arXiv:2107.13335v1 [cs.CV] 28 Jul 2021