Deep Spatial Pyramid Match Kernel for Scene Classification
Shikha Gupta
1
, Deepak Kumar Pradhan
2
, Dileep Aroor Dinesh
1
and Veena Thenkanidiyoor
2
1
School of Computing and EE, Indian Institute of Technology, Mandi, H.P., India
2
Department of CSE, National Institute of Technology Goa, Ponda, Goa, India
Keywords: Scene Classification, Dynamic Kernel, Set of Varying Length Feature Map, Support Vector Machine,
Convolutional Neural Network, Deep Spatial Pyramid Match Kernel.
Abstract: Several works have shown that Convolutional Neural Networks (CNNs) can be easily adapted to different
datasets and tasks. However, for extracting the deep features from these pre-trained deep CNNs a fixed-
size (e.g., 227 × 227) input image is mandatory. Now the state-of-the-art datasets like MIT-67 and SUN-397
come with images of different sizes. Usage of CNNs for these datasets enforces the user to bring different
sized images to a fixed size either by reducing or enlarging the images. The curiosity is obvious that “Isn’t
the conversion to fixed size image is lossy ?”. In this work, we provide a mechanism to keep these lossy fixed
size images aloof and process the images in its original form to get set of varying size deep feature maps,
hence being lossless. We also propose deep spatial pyramid match kernel (DSPMK) which amalgamates set
of varying size deep feature maps and computes a matching score between the samples. Proposed DSPMK
act as a dynamic kernel in the classification framework of scene dataset using support vector machine. We
demonstrated the effectiveness of combining the power of varying size CNN-based set of deep feature maps
with dynamic kernel by achieving state-of-the-art results for high-level visual recognition tasks such as scene
classification on standard datasets like MIT67 and SUN397.
1 INTRODUCTION
CNNs have been popular these days for their ap-
plicability to wide range of tasks, such as object
recognition (Simonyan and Zisserman, 2014), (Gir-
shick et al., 2014), (Chatfield et al., 2014), image
segmentation (Kang and Wang, 2014), image re-
trieval (Zhao et al., 2015), scene classification (He
et al., 2015), (Yoo et al., 2014) and so on. Spectacu-
lar results for the state-of-the-art tasks are mainly be-
cause of powerful feature representation learnt from
CNNs. Scene image classification, being the most
basic and important aspect of computer vision, has
received a high degree of attention among the re-
searchers. An important issue in scene image clas-
sification is intra-class variability i.e, the images of
a particular class differ so much in their visual ap-
pearance and inter-class similarity i.e, images of dif-
ferent class are very much confusable and composed
by the similar concepts. For addressing these is-
sues many deep CNNs such as AlexNet (Krizhevsky
et al., 2012), GoogLeNet (Szegedy et al., 2015)
and VGGNet-16 (Simonyan and Zisserman, 2014)
have already been trained on datasets like Places-205,
Places-365 (Zhou et al., 2017) and ImageNet (Deng
et al., 2009) for image classification tasks. These
deep CNNs can be adapted and retrained for other
datasets and tasks with a little fine-tuning. In all
such cases, features obtained from pre-trained or fine-
tuned CNNs are used to build fully connected neural
network or SVM-based classifier. These CNNs also
became popular to greater extent as they are useful
in providing base architecture and features for many
other similar tasks than the one for which they are
trained. For example, AlexNet (Krizhevsky et al.,
2012) is trained for object recognition. However,
(Mandar et al., 2015) used the features for scene clas-
sification by further enhancing through Fisher encod-
ing. These CNNs require images to be input in a
fixed size. For example the AlexNet accepts images
of size “227 × 227”. However the state-of-the-art
datasets like SUN397 (Xiao et al., 2010) or MIT-67
indoor (Quattoni and Torralba, 2009) scene datasets
comprise of varying sized images which are much
larger than “227 × 227”. The conventional approach
to use these CNNs is to resize the arbitrary-sized im-
ages to a fixed size. This results in loss of informa-
tion of the image before feeding it to the CNN for
feature extraction. The performance of classification
Gupta, S., Pradhan, D., Dinesh, D. and Thenkanidiyoor, V.
Deep Spatial Pyramid Match Kernel for Scene Classification.
DOI: 10.5220/0006596101410148
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 141-148
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
141