IDEnet : Inception-Based Deep Convolutional
Neural Network for Crowd Counting Estimation
Samuel Cahyawijaya
Institut Teknologi Bandung
Bandung, Indonesia
samuel.cahyawijaya@gmail.com
Bryan Wilie
Institut Teknologi Bandung
Bandung, Indonesia
brywilie25@gmail.com
Widyawardana Adiprawita
Institut Teknologi Bandung
Bandung, Indonesia
wadiprawita@stei.itb.ac.id
Abstract— In crowd counting task, our goals are to estimate
density map and count of people from the given crowd image.
From our analysis, there are two major problems that need to
be solved in the crowd counting task, which are scale invariant
problem and inhomogeneous density problem. Many methods
have been developed to tackle these problems by designing a
dense aware model, scale adaptive model, etc. Our approach is
derived from scale invariant problem and inhomogeneous
density problem and we propose a dense aware inception based
neural network in order to tackle both problems. We introduce
our novel inception based crowd counting model called
Inception Dense Estimator network (IDEnet). Our IDEnet is
divided into 2 modules, which are Inception Dense Block (IDB)
and Dense Evaluator Unit (DEU). Some variations of IDEnet are
evaluated and analysed in order to find out the best model. We
evaluate our best model on UCF50 and ShanghaiTech dataset.
Our IDEnet outperforms the current state-of-the-art method in
ShanghaiTech part B dataset. We conclude our work with 6 key
conclusions based on our experiments and error analysis.
Keywords—crowd counting, inception network, convolutional
neural network, deep learning, dense aware, scale adaptive
I. INTRODUCTION
Crowd counting is a task to perform counting on a large
number of specified objects from the given image. In small
number object counting, a detection based approach is likely
to be used, such method works well in most low-density
(sparse) image, but usually failed on a high-density (crowd)
image
[9][11][14]
. In crowd counting task, model is developed to
fit the dense map of the image and output the total predicted
count from the given image. From our analysis, there are two
major problems which need to be tackled in order to get better
counting estimation. The first problem is scale invariant
problem which is caused by variety scale of an object from the
given image. The second problem is inhomogeneous density
problem which is caused by the difference density level of
each crowd image. Both problems lead to the difficulty in
choosing the right filter size for each image region.
To handle those problems we bring the idea from inception
netwok. Inception network is divided into several repeatable
kind of inception modules. Each inception module lets the
network learns the best filter to be used by providing multiple
paths of computational graph. Inception netwok first
introduced in 2014 by Szegedy, C. et al.
[1]
and there have been
some countinuous improvement versions of it, starting from
Inception-v1
[1]
; BN Inception
[2]
; Inception-v2 and Inception-
v3
[3]
; Inception-v4, Inception Resnet-v1, and Inception
Resnet-v2
[4]
. Inception network have been evaluated against
ILSVRC dataset and resulting in a really high accuracy.
In this paper, we introduce a novel approach based on
Inception Network v1 called Inception Dense Estimator
Network (IDEnet). There are 4 main contributions of this
works. First, in section III, we show a novel methodology to
apply Inception Network idea in counting task, especially the
modification to handle crowd counting task. Second, in
subsection IV.B, we report our alternative results that we get
when implementing some alternatives of IDEnet architecture.
Third, in subsection IV.C, we evaluate our final proposed
model with two publicly available crowd counting datasets
(UCF50 and ShanghaiTech), and bencmark our evaluation
result with other methods. Fourth, in subsection IV.D, we
conduct manual error analysis to get more understanding
about the counting estimation error.
II. RELATED WORKS
Works in the crowd counting task can be divided into 2
methods, detection-based and regression-based. We focus our
study on regression-based methods because detection-based
methods tend to severely suffer in crowd with high occlusion
level
[9][11][14]
. Some regression analysis approaches have been
conducted for crowd counting tasks. A texture analysis with
edge and foreground detection has been conducted
[5]
in 2005.
A bayesian poisson regression technique
[6]
has been evaluated
on a sparse crowd image in 2009. A multiple texture analysis
approach
[7]
has been performed by combining several texture
analysis techniques, which are head detection, fourier, wavelet
transform, interest point analysis, and GLCM.
Several regression-based methods are designed to be scale
adaptive. In order to be scale adaptive, most works implement
a multi column network
[8][9][10]
. In multi column network,
input image is splited into several different subnetworks,
where each subnetwork has different architecture and
hyperparameters. The output from subnetworks are then
merged to estimate the count. In another work, named
Switching CNN
[11]
, the network divided into a switch module
and 3 different counting modules. The switch will choose
which regressor should be used for the given input image.
Some other regression-based methods utilise
spatiotemporal features by using sequence of images to
improve the counting quality. Xiong, F. et al.
[12]
utilise
convolution LSTM layer in order to process sequence of
images into the estimated dense map. Liu, W et al.
[13]
process
sequence of images with a siamese network approach where
each subnetwork will extract spatial feature from image at
time T and then combined with some temporal constraints.
Another approach called scale-adaptive CNN
[14],
develops
a scale adaptive single-column network by utilising pooling,
residual, and deconvolution layer. Another work, called
Pyramidal CNN
[15]
, is estimating global and local context to
achieve better estimation. In Liu et al.
[16]
, dense rank is
generated from image and both count and rank are estimated
to improve the quality of the model. Another work called
DecideNet
[17]
, use an approach similar to a multi column, but
some columns interact with another column by sending their
output as one of the another column’s input.
Proceeding of EECSI 2018, Malang - Indonesia, 16-18 Oct 2018
978-1-5386-8402-3/18/$31.00 ©2018 IEEE 548