CountNet: End to End Deep Learning for Crowd Counting Bryan Wilie Bandung Institute of Technology Bandung, Indonesia brywilie25@gmail.com Samuel Cahyawijaya Bandung Institute of Technology Bandung, Indonesia samuel.cahyawijaya@gmail.com Widyawardana Adiprawita Bandung Institute of Technology Bandung, Indonesia wadiprawita@stei.itb.ac.id Abstract—We approach crowd counting problem as a complex end to end deep learning process that needs both a correct recognition and counting. This paper redefines the crowd counting process to be a counting process, rather than just a recognition process as previously defined. Xception Network is used in the CountNet and layered again with fully connected layers. The Xception Network pre-trained parameter is used as transfer learning to be trained again with the fully connected layers. CountNet then achieved a better crowd counting performance by training it with augmented dataset that robust to scale and slice variations. Keywords—transfer learning, crowd counting, deep learning I. INTRODUCTION Crowd counting task in deep learning community is aiming to count every head of a human being present in a crowd shown in a photograph. The crowd in the photo is usually present in a different density, hence the name of the crowd counting, in a dense and sparse crowd. The crowd counting problem is actually a counting problem, done by estimating the number of people in the crowd, in regards to the distribution of the crowd density at one gathering area. One of the uniqueness in this deep learning task is that not only the whole photograph can be a training data, but also slices of the photograph can represent a whole photograph as a no crowd slices, sparse crowd slices, dense crowd slices, and the mixture of them all. This brings advantage to data collection process. Whilst in another deep learning task, researcher needs to acquire hundred thousands of data, 50 high resolution photograph can already make the same abundances. From this abundances, we can get millions of training data, just by using some augmenting processes. Besides that, crowd counting is a vast perspective problem. Some of the perspective in the earlier work is the detection [12, 13, 20, 21, 22, 24] and regression [2, 3, 4, 5, 8, 23] network. Detection approach crowd counting is successful for scenes with low crowd density, but the performance on a very dense crowd is still problematic. This happened because on these dense crowd environment, usually only partial of the whole humans are visible, only head to shoulder for horizontally taken photos, and only the top of the head, for a orthogonally taken photos. For this method, parts to be detected by the method is too small, and the counting method will not detect any object that is not a crowd. This is why this method tends to underestimate the counting in a dense crowd settings, and that is still a challenge for the detection method. While counting by detection needs big part of a human body being located, crowd counting by regression simply estimates crowd counts without knowing the location of each person. Density estimation is sometimes used as an intermediate result, and then using a linear operation, e.g. sum, a crowd counting method get the overall crowd count results [2, 3]. The regression part, in [5] for example, is using fully CNN model for counting in highly congested scenes. Different with detection based crowd counting, regression based counting tends to overestimate sparse crowd settings counting prediction. This is happening because regression method is trying to find an n-dimensional polynomial function of linear and non-linear relationship between pixels and counting from each of the pixel. The performance of this method relies on the statistical stability of the pixels data. Thus, regression method needs is to explore intrinsic statistical principle of the whole data. Density estimation method itself is good for regression crowd counting if the intermediate output is handled again by a human processor or processed in an optimized hand engineered feature mapping. As described before, one of feature mapping used is a linear operation, which is to sum each pixels to get the crowd count. This approach should be working smoothly if the density making process could be inversed without a loss, or had an inverse for each pixels translation into the density, or if the blur filtered area’s total pixels value can be retained after the filtering process, so that the density making process do not change the ground truth crowd count. Crowd counting is a task with a rich variety of low and high level features and not only has many non- linearity in its inputs, but also has many non-linearity in its outputs. This actually is not a simple counting task, it is actually a task to generalize massive non-linearity provided by differences of the crowd density. This research approaches crowd counting task as an end to end deep learning process. This process is partly different with some previous implementation of crowd counter. Some implementation only apply deep learning algorithm until it produces the output of predicted density map, thus the title, density estimation, and then sum the predicted density map to get the predicted crowd count. By that term, the algorithm performance is limited by the chosen counting method, and the end to end deep learning process is opening that limit so that the machine can also learn better counting method as well. The limitation is illustrated in Fig. 1, and the end-to- end solution is illustrated in Fig. 2. Fig. 1. Previous implementations, introducing errors as e1, e2, and e3 Fig. 2. End to end deep learning implementations. As we can see on Fig. 1., the previous’ implementations introduces three kinds of errors, e1, e2, and e3, from the In Density Prediction e1 Density Out e2 e3 In End to end deep learning method Out e1, e2, e3 Proceeding of EECSI 2018, Malang - Indonesia, 16-18 Oct 2018 978-1-5386-8402-3/18/$31.00 ©2018 IEEE 128