Wavelet RCNN: Enhancing Object detection accuracy through Spectral Information Shivangi Nigam Dept. of Information Technology IIIT, Allahabad Prayagraj, UP, India rsi2018506@iiita.ac.in Shekhar Verma Dept. of Information Technology IIIT, Allahabad Prayagraj, UP, India sverma@iiita.ac.in P. Nagabhushan Dept. of Information Technology IIIT, Allahabad Prayagraj, UP, India pnagabhushan@iiita.ac.in Abstract—We propose an object detection model which uses Wavelet transforms to address the trade-off between its speed and accuracy. The feature extraction network uses wavelet transforms to aid the object detector with spectral information and further reduce the parameters of the detection pipeline. The multi- resolution analysis performed by wavelet transforms decomposes image into high frequency and low frequency feature maps. The statistics of the frequency/spectral information at different scales and orientations define image features which improve accuracy of object localisation. The idea is to reduce the resolution of feature maps and increase receptive field size. The sparsity induced in feature maps by wavelet transforms is an efficient way to look for relevant features, thus significantly reducing pipeline parameters. This helps to improve the speed of the Wavelet RCNN model. To improve the classification of the model, traditional ROI pooling is replaced by wavelet pooling. This has been evaluated on the PASCAL VOC dataset on which it achieves the mAP of 75.7%. Model performance is evaluated using orthogonal wavelet (Haar) and Biorthogonal wavelet (Bior3.5). A significant increase in speed with a slight increase in accuracy was achieved using orthogonal wavelets. While a significant increase in accuracy with slight increase in speed was achieved using Biorthogonal wavelets. Index Terms—Object detection, Convolution neural network, Feature extraction, wavelet transformation I. I NTRODUCTION Object detection and recognition involves feature extraction for semantic understanding of images and classification for perceiving its contents. The objective is to precisely classify and estimate the locations of objects by a bounding box. The most researched object detection domains include image classification, human behaviour analysis, face recognition, autonomous driving, pose detection, scene text detection, etc. The classical object detection methods had a segregated pipeline with manually engineered techniques for tasks such as Feature extraction, region of interest (ROI) selection and object classification. It was limited by feature representation methods such as SIFT [1], Haar [2], HOG [3]. The input images were resized in multiple scales, and multiple sliding windows were used to cater to multi-scale and multiple-aspect ratio detection objectives. The classifiers, such as Support Vector Machines (SVM) [4], were applied to the proposed ROIs to assign them class labels. Although these object detectors Identify applicable funding agency here. If none, delete this. achieved impressive results on small datasets such as PASCAL VOC [5], there were many limitations as handcrafted feature descriptors were limited to the object categories they are trained for, a large number of region proposals produced are mostly redundant which cause an imbalance of proposals and segregated detection pipeline required separate design and optimization strategy. With the advent of deep learning techniques, the efficiency and accuracy of object detection systems have seen substan- tial progress. This advancement is induced by the progress of large-scale parallel computing capabilities such as GPUs and the availability of large amounts of datasets (ImageNet [6], PASCAL VOC [5], MS COCO [7]). The deep learning methods are based on neural networks with multiple layers, each with multiple units. Various deep-learning architectures have different layering structures. Out of many deep learning architectures, such as CNN, RNN, GANs, etc., the CNNs are the most widely used architectures for object detection tasks. CNNs are also used as backbone networks for feature extraction purposes. With different objectives of efficiency vs. accuracy, different backbone architectures are used. ResNet [8] and ResNeXt [9] are used when deep and densely con- nected networks are required to achieve high precision and accuracy. The lightweight backbones such as MobileNet [10], MobileNetV2 [11], ShufeNet [12], SqueezeNet [13], Xcep- tion [14], are used for real-time applications. Deep networks require large datasets and high computing power for classifica- tion, detection and recognition purposes. However, the beauty of deep networks lies in the fact that they learn the features from raw data and do not depend on manually created filters. There are numerous advantages of deep learning-based Object detection against classical detection techniques. 1) Effective high-level feature representation 2) Multi-task Learning 3) High learning capacity. The object detection approaches Two-stage and Single stage detectors; there is a speed vs accuracy trade-off [15]. Two- stage detectors provide high-precision bounding boxes due to a dedicated network for region proposals; on the contrary, the one-stage detectors have low computational requirements as they encase the computations into a single detection pipeline. Speeding up the process and consequently improving accuracy requires the object detection pipeline to achieve good object