CountNet: End to End Deep Learning for Crowd
Counting
Bryan Wilie
Bandung Institute of Technology
Bandung, Indonesia
brywilie25@gmail.com
Samuel Cahyawijaya
Bandung Institute of Technology
Bandung, Indonesia
samuel.cahyawijaya@gmail.com
Widyawardana Adiprawita
Bandung Institute of Technology
Bandung, Indonesia
wadiprawita@stei.itb.ac.id
Abstract—We approach crowd counting problem as a
complex end to end deep learning process that needs both a
correct recognition and counting. This paper redefines the
crowd counting process to be a counting process, rather than
just a recognition process as previously defined. Xception
Network is used in the CountNet and layered again with fully
connected layers. The Xception Network pre-trained
parameter is used as transfer learning to be trained again with
the fully connected layers. CountNet then achieved a better
crowd counting performance by training it with augmented
dataset that robust to scale and slice variations.
Keywords—transfer learning, crowd counting, deep learning
I. INTRODUCTION
Crowd counting task in deep learning community is
aiming to count every head of a human being present in a
crowd shown in a photograph. The crowd in the photo is
usually present in a different density, hence the name of the
crowd counting, in a dense and sparse crowd. The crowd
counting problem is actually a counting problem, done by
estimating the number of people in the crowd, in regards to
the distribution of the crowd density at one gathering area.
One of the uniqueness in this deep learning task is that
not only the whole photograph can be a training data, but
also slices of the photograph can represent a whole
photograph as a no crowd slices, sparse crowd slices, dense
crowd slices, and the mixture of them all. This brings
advantage to data collection process. Whilst in another deep
learning task, researcher needs to acquire hundred thousands
of data, 50 high resolution photograph can already make the
same abundances. From this abundances, we can get millions
of training data, just by using some augmenting processes.
Besides that, crowd counting is a vast perspective
problem. Some of the perspective in the earlier work is the
detection [12, 13, 20, 21, 22, 24] and regression [2, 3, 4, 5, 8,
23] network. Detection approach crowd counting is
successful for scenes with low crowd density, but the
performance on a very dense crowd is still problematic. This
happened because on these dense crowd environment,
usually only partial of the whole humans are visible, only
head to shoulder for horizontally taken photos, and only the
top of the head, for a orthogonally taken photos. For this
method, parts to be detected by the method is too small, and
the counting method will not detect any object that is not a
crowd. This is why this method tends to underestimate the
counting in a dense crowd settings, and that is still a
challenge for the detection method.
While counting by detection needs big part of a human
body being located, crowd counting by regression simply
estimates crowd counts without knowing the location of each
person. Density estimation is sometimes used as an
intermediate result, and then using a linear operation, e.g.
sum, a crowd counting method get the overall crowd count
results [2, 3]. The regression part, in [5] for example, is using
fully CNN model for counting in highly congested scenes.
Different with detection based crowd counting, regression
based counting tends to overestimate sparse crowd settings
counting prediction. This is happening because regression
method is trying to find an n-dimensional polynomial
function of linear and non-linear relationship between pixels
and counting from each of the pixel. The performance of this
method relies on the statistical stability of the pixels data.
Thus, regression method needs is to explore intrinsic
statistical principle of the whole data.
Density estimation method itself is good for regression
crowd counting if the intermediate output is handled again by
a human processor or processed in an optimized hand
engineered feature mapping. As described before, one of
feature mapping used is a linear operation, which is to sum
each pixels to get the crowd count. This approach should be
working smoothly if the density making process could be
inversed without a loss, or had an inverse for each pixels
translation into the density, or if the blur filtered area’s total
pixels value can be retained after the filtering process, so that
the density making process do not change the ground truth
crowd count. Crowd counting is a task with a rich variety of
low and high level features and not only has many non-
linearity in its inputs, but also has many non-linearity in its
outputs. This actually is not a simple counting task, it is
actually a task to generalize massive non-linearity provided
by differences of the crowd density.
This research approaches crowd counting task as an end
to end deep learning process. This process is partly different
with some previous implementation of crowd counter. Some
implementation only apply deep learning algorithm until it
produces the output of predicted density map, thus the title,
density estimation, and then sum the predicted density map
to get the predicted crowd count. By that term, the algorithm
performance is limited by the chosen counting method, and
the end to end deep learning process is opening that limit so
that the machine can also learn better counting method as
well. The limitation is illustrated in Fig. 1, and the end-to-
end solution is illustrated in Fig. 2.
Fig. 1. Previous implementations, introducing errors as e1, e2, and e3
Fig. 2. End to end deep learning implementations.
As we can see on Fig. 1., the previous’ implementations
introduces three kinds of errors, e1, e2, and e3, from the
In Density Prediction
e1
Density Out
e2
e3
In End to end deep learning method Out
e1, e2, e3
Proceeding of EECSI 2018, Malang - Indonesia, 16-18 Oct 2018
978-1-5386-8402-3/18/$31.00 ©2018 IEEE 128