Deep Spatial Pyramid Match Kernel for Scene Classification Shikha Gupta 1 , Deepak Kumar Pradhan 2 , Dileep Aroor Dinesh 1 and Veena Thenkanidiyoor 2 1 School of Computing and EE, Indian Institute of Technology, Mandi, H.P., India 2 Department of CSE, National Institute of Technology Goa, Ponda, Goa, India Keywords: Scene Classification, Dynamic Kernel, Set of Varying Length Feature Map, Support Vector Machine, Convolutional Neural Network, Deep Spatial Pyramid Match Kernel. Abstract: Several works have shown that Convolutional Neural Networks (CNNs) can be easily adapted to different datasets and tasks. However, for extracting the deep features from these pre-trained deep CNNs a fixed- size (e.g., 227 × 227) input image is mandatory. Now the state-of-the-art datasets like MIT-67 and SUN-397 come with images of different sizes. Usage of CNNs for these datasets enforces the user to bring different sized images to a fixed size either by reducing or enlarging the images. The curiosity is obvious that “Isn’t the conversion to fixed size image is lossy ?”. In this work, we provide a mechanism to keep these lossy fixed size images aloof and process the images in its original form to get set of varying size deep feature maps, hence being lossless. We also propose deep spatial pyramid match kernel (DSPMK) which amalgamates set of varying size deep feature maps and computes a matching score between the samples. Proposed DSPMK act as a dynamic kernel in the classification framework of scene dataset using support vector machine. We demonstrated the effectiveness of combining the power of varying size CNN-based set of deep feature maps with dynamic kernel by achieving state-of-the-art results for high-level visual recognition tasks such as scene classification on standard datasets like MIT67 and SUN397. 1 INTRODUCTION CNNs have been popular these days for their ap- plicability to wide range of tasks, such as object recognition (Simonyan and Zisserman, 2014), (Gir- shick et al., 2014), (Chatfield et al., 2014), image segmentation (Kang and Wang, 2014), image re- trieval (Zhao et al., 2015), scene classification (He et al., 2015), (Yoo et al., 2014) and so on. Spectacu- lar results for the state-of-the-art tasks are mainly be- cause of powerful feature representation learnt from CNNs. Scene image classification, being the most basic and important aspect of computer vision, has received a high degree of attention among the re- searchers. An important issue in scene image clas- sification is intra-class variability i.e, the images of a particular class differ so much in their visual ap- pearance and inter-class similarity i.e, images of dif- ferent class are very much confusable and composed by the similar concepts. For addressing these is- sues many deep CNNs such as AlexNet (Krizhevsky et al., 2012), GoogLeNet (Szegedy et al., 2015) and VGGNet-16 (Simonyan and Zisserman, 2014) have already been trained on datasets like Places-205, Places-365 (Zhou et al., 2017) and ImageNet (Deng et al., 2009) for image classification tasks. These deep CNNs can be adapted and retrained for other datasets and tasks with a little fine-tuning. In all such cases, features obtained from pre-trained or fine- tuned CNNs are used to build fully connected neural network or SVM-based classifier. These CNNs also became popular to greater extent as they are useful in providing base architecture and features for many other similar tasks than the one for which they are trained. For example, AlexNet (Krizhevsky et al., 2012) is trained for object recognition. However, (Mandar et al., 2015) used the features for scene clas- sification by further enhancing through Fisher encod- ing. These CNNs require images to be input in a fixed size. For example the AlexNet accepts images of size “227 × 227”. However the state-of-the-art datasets like SUN397 (Xiao et al., 2010) or MIT-67 indoor (Quattoni and Torralba, 2009) scene datasets comprise of varying sized images which are much larger than “227 × 227”. The conventional approach to use these CNNs is to resize the arbitrary-sized im- ages to a fixed size. This results in loss of informa- tion of the image before feeding it to the CNN for feature extraction. The performance of classification Gupta, S., Pradhan, D., Dinesh, D. and Thenkanidiyoor, V. Deep Spatial Pyramid Match Kernel for Scene Classification. DOI: 10.5220/0006596101410148 In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 141-148 ISBN: 978-989-758-276-9 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved 141