Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation Shikha Gupta 1 , Deepak Kumar Pradhan 2 , Dileep Aroor Dinesh 1 and Veena Thenkanidiyoor 2 1 School of Computing and EE, Indian Institute of Technology, Mandi, H.P., India 2 Department of CSE, National Institute of Technology Goa, Ponda, Goa, India Keywords: Scene Classiﬁcation, Dynamic Kernel, Set of Varying Length Feature Map, Support Vector Machine, Convolutional Neural Network, Deep Spatial Pyramid Match Kernel. Abstract: Several works have shown that Convolutional Neural Networks (CNNs) can be easily adapted to different datasets and tasks. However, for extracting the deep features from these pre-trained deep CNNs a ﬁxed- size (e.g., 227 × 227) input image is mandatory. Now the state-of-the-art datasets like MIT-67 and SUN-397 come with images of different sizes. Usage of CNNs for these datasets enforces the user to bring different sized images to a ﬁxed size either by reducing or enlarging the images. The curiosity is obvious that “Isn’t the conversion to ﬁxed size image is lossy ?”. In this work, we provide a mechanism to keep these lossy ﬁxed size images aloof and process the images in its original form to get set of varying size deep feature maps, hence being lossless. We also propose deep spatial pyramid match kernel (DSPMK) which amalgamates set of varying size deep feature maps and computes a matching score between the samples. Proposed DSPMK act as a dynamic kernel in the classiﬁcation framework of scene dataset using support vector machine. We demonstrated the effectiveness of combining the power of varying size CNN-based set of deep feature maps with dynamic kernel by achieving state-of-the-art results for high-level visual recognition tasks such as scene classiﬁcation on standard datasets like MIT67 and SUN397. 1 INTRODUCTION CNNs have been popular these days for their ap- plicability to wide range of tasks, such as object recognition (Simonyan and Zisserman, 2014), (Gir- shick et al., 2014), (Chatﬁeld et al., 2014), image segmentation (Kang and Wang, 2014), image re- trieval (Zhao et al., 2015), scene classiﬁcation (He et al., 2015), (Yoo et al., 2014) and so on. Spectacu- lar results for the state-of-the-art tasks are mainly be- cause of powerful feature representation learnt from CNNs. Scene image classiﬁcation, being the most basic and important aspect of computer vision, has received a high degree of attention among the re- searchers. An important issue in scene image clas- siﬁcation is intra-class variability i.e, the images of a particular class differ so much in their visual ap- pearance and inter-class similarity i.e, images of dif- ferent class are very much confusable and composed by the similar concepts. For addressing these is- sues many deep CNNs such as AlexNet (Krizhevsky et al., 2012), GoogLeNet (Szegedy et al., 2015) and VGGNet-16 (Simonyan and Zisserman, 2014) have already been trained on datasets like Places-205, Places-365 (Zhou et al., 2017) and ImageNet (Deng et al., 2009) for image classiﬁcation tasks. These deep CNNs can be adapted and retrained for other datasets and tasks with a little ﬁne-tuning. In all such cases, features obtained from pre-trained or ﬁne- tuned CNNs are used to build fully connected neural network or SVM-based classiﬁer. These CNNs also became popular to greater extent as they are useful in providing base architecture and features for many other similar tasks than the one for which they are trained. For example, AlexNet (Krizhevsky et al., 2012) is trained for object recognition. However, (Mandar et al., 2015) used the features for scene clas- siﬁcation by further enhancing through Fisher encod- ing. These CNNs require images to be input in a ﬁxed size. For example the AlexNet accepts images of size “227 × 227”. However the state-of-the-art datasets like SUN397 (Xiao et al., 2010) or MIT-67 indoor (Quattoni and Torralba, 2009) scene datasets comprise of varying sized images which are much larger than “227 × 227”. The conventional approach to use these CNNs is to resize the arbitrary-sized im- ages to a ﬁxed size. This results in loss of informa- tion of the image before feeding it to the CNN for feature extraction. The performance of classiﬁcation Gupta, S., Pradhan, D., Dinesh, D. and Thenkanidiyoor, V. Deep Spatial Pyramid Match Kernel for Scene Classiﬁcation. DOI: 10.5220/0006596101410148 In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 141-148 ISBN: 978-989-758-276-9 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved 141