IDEnet : Inception-Based Deep Convolutional Neural Network for Crowd Counting Estimation Samuel Cahyawijaya Institut Teknologi Bandung Bandung, Indonesia samuel.cahyawijaya@gmail.com Bryan Wilie Institut Teknologi Bandung Bandung, Indonesia brywilie25@gmail.com Widyawardana Adiprawita Institut Teknologi Bandung Bandung, Indonesia wadiprawita@stei.itb.ac.id Abstract— In crowd counting task, our goals are to estimate density map and count of people from the given crowd image. From our analysis, there are two major problems that need to be solved in the crowd counting task, which are scale invariant problem and inhomogeneous density problem. Many methods have been developed to tackle these problems by designing a dense aware model, scale adaptive model, etc. Our approach is derived from scale invariant problem and inhomogeneous density problem and we propose a dense aware inception based neural network in order to tackle both problems. We introduce our novel inception based crowd counting model called Inception Dense Estimator network (IDEnet). Our IDEnet is divided into 2 modules, which are Inception Dense Block (IDB) and Dense Evaluator Unit (DEU). Some variations of IDEnet are evaluated and analysed in order to find out the best model. We evaluate our best model on UCF50 and ShanghaiTech dataset. Our IDEnet outperforms the current state-of-the-art method in ShanghaiTech part B dataset. We conclude our work with 6 key conclusions based on our experiments and error analysis. Keywords—crowd counting, inception network, convolutional neural network, deep learning, dense aware, scale adaptive I. INTRODUCTION Crowd counting is a task to perform counting on a large number of specified objects from the given image. In small number object counting, a detection based approach is likely to be used, such method works well in most low-density (sparse) image, but usually failed on a high-density (crowd) image [9][11][14] . In crowd counting task, model is developed to fit the dense map of the image and output the total predicted count from the given image. From our analysis, there are two major problems which need to be tackled in order to get better counting estimation. The first problem is scale invariant problem which is caused by variety scale of an object from the given image. The second problem is inhomogeneous density problem which is caused by the difference density level of each crowd image. Both problems lead to the difficulty in choosing the right filter size for each image region. To handle those problems we bring the idea from inception netwok. Inception network is divided into several repeatable kind of inception modules. Each inception module lets the network learns the best filter to be used by providing multiple paths of computational graph. Inception netwok first introduced in 2014 by Szegedy, C. et al. [1] and there have been some countinuous improvement versions of it, starting from Inception-v1 [1] ; BN Inception [2] ; Inception-v2 and Inception- v3 [3] ; Inception-v4, Inception Resnet-v1, and Inception Resnet-v2 [4] . Inception network have been evaluated against ILSVRC dataset and resulting in a really high accuracy. In this paper, we introduce a novel approach based on Inception Network v1 called Inception Dense Estimator Network (IDEnet). There are 4 main contributions of this works. First, in section III, we show a novel methodology to apply Inception Network idea in counting task, especially the modification to handle crowd counting task. Second, in subsection IV.B, we report our alternative results that we get when implementing some alternatives of IDEnet architecture. Third, in subsection IV.C, we evaluate our final proposed model with two publicly available crowd counting datasets (UCF50 and ShanghaiTech), and bencmark our evaluation result with other methods. Fourth, in subsection IV.D, we conduct manual error analysis to get more understanding about the counting estimation error. II. RELATED WORKS Works in the crowd counting task can be divided into 2 methods, detection-based and regression-based. We focus our study on regression-based methods because detection-based methods tend to severely suffer in crowd with high occlusion level [9][11][14] . Some regression analysis approaches have been conducted for crowd counting tasks. A texture analysis with edge and foreground detection has been conducted [5] in 2005. A bayesian poisson regression technique [6] has been evaluated on a sparse crowd image in 2009. A multiple texture analysis approach [7] has been performed by combining several texture analysis techniques, which are head detection, fourier, wavelet transform, interest point analysis, and GLCM. Several regression-based methods are designed to be scale adaptive. In order to be scale adaptive, most works implement a multi column network [8][9][10] . In multi column network, input image is splited into several different subnetworks, where each subnetwork has different architecture and hyperparameters. The output from subnetworks are then merged to estimate the count. In another work, named Switching CNN [11] , the network divided into a switch module and 3 different counting modules. The switch will choose which regressor should be used for the given input image. Some other regression-based methods utilise spatiotemporal features by using sequence of images to improve the counting quality. Xiong, F. et al. [12] utilise convolution LSTM layer in order to process sequence of images into the estimated dense map. Liu, W et al. [13] process sequence of images with a siamese network approach where each subnetwork will extract spatial feature from image at time T and then combined with some temporal constraints. Another approach called scale-adaptive CNN [14], develops a scale adaptive single-column network by utilising pooling, residual, and deconvolution layer. Another work, called Pyramidal CNN [15] , is estimating global and local context to achieve better estimation. In Liu et al. [16] , dense rank is generated from image and both count and rank are estimated to improve the quality of the model. Another work called DecideNet [17] , use an approach similar to a multi column, but some columns interact with another column by sending their output as one of the another column’s input. Proceeding of EECSI 2018, Malang - Indonesia, 16-18 Oct 2018 978-1-5386-8402-3/18/$31.00 ©2018 IEEE 548