CCCNet: An Attention Based Deep Learning Framework for Categorized Counting of Crowd in Different Body States Sarkar Snigdha Sarathi Das* Department of CSE Bangladesh University of Engineering and Technology Dhaka, Bangladesh sarathismg@gmail.com Syed Md. Mukit Rashid* Department of CSE Bangladesh University of Engineering and Technology Dhaka, Bangladesh mukitrashid270596@gmail.com Mohammed Eunus Ali Department of CSE Bangladesh University of Engineering and Technology Dhaka, Bangladesh mohammed.eunus.ali@gmail.com Abstract—Crowd counting problem that counts the number of people in an image has been extensively studied in recent years. In this paper, we introduce a new variant of crowd counting problem, namely categorized crowd counting, that counts the number of people sitting and standing in a given image. Catego- rized crowd counting has many real-world applications such as crowd monitoring, customer service, and resource management. The major challenges in categorized crowd counting come from high occlusion, perspective distortion and the seemingly identical upper body posture of sitting and standing persons. Existing density map based approaches perform well to approximate a large crowd, but lose important local information necessary for categorization. On the other hand, traditional detection-based approaches perform poorly in occluded environments, especially when the crowd size gets bigger. Hence, to solve the categorized crowd counting problem, we develop a novel attention-based deep learning framework that addresses the above limitations. In particular, our approach works in three phases: i) We first generate basic detection based sitting and standing density maps to capture the local information; ii) Then, we generate a crowd counting based density map as global counting feature; iii) Finally, we have a cross-branch segregating refinement phase that splits the crowd density map into final sitting and standing density maps using attention mechanism. Extensive experiments show the efficacy of our approach in solving the categorized crowd counting problem. Index Terms—Crowd Counting, Convolutional Neural Net- works, Attention Mechanism, Human Pose Estimation I. I NTRODUCTION The crowd counting problem that counts the number of people in a given image, has gained considerable attention in recent years due to its intense demand in video surveil- lance, public safety, and urban planning. Counting crowd by automatic scene analysis is a challenging task due to occlusion, complex background, non-uniform distributions of scale and perspective variations. A plethora of techniques have been proposed in recent years (e.g., [1]–[3]) to address these challenges and to increase the accuracy of crowd count in different real-world environments. * Equal Contribution Fig. 1: Example Images From Our Dataset In this paper, we introduce a new variant of crowd counting, namely categorized crowd counting, that counts the number of persons sitting and standing separately in a given image. There are many practical applications of categorized crowd counting. For example, a bank manager may want to know the number of customers who are waiting, standing inside the service area of the bank so that s/he can increase the on- demand resource for better service to the customers; a bus/tram operator may want to know the number of standing passengers and sitting passengers in the bus/tram, which will help them to decide on the frequency and size of transports needed in different times of the day; a service provider may want to know the number of standing and sitting customers in a room to decide on the facility that they should provide. In general, the categorized crowd counting will add a new dimension in providing quality services especially in restaurants, banks, airport waiting areas, subway, and public transport where delivering quality customer service is crucial. To the best of our knowledge, we are the first to attempt the problem of categorized crowd counting. Existing approaches for general crowd counting can be largely divided into two groups: (i) the most recent density- based approaches (e.g., [1]–[6]) that generate density of the crowd to approximate a large crowd in outdoor environment, and the detection based approaches that detect visible human body parts [7], [8] to count the number of persons in a given (mostly indoor) image. Though the density-based counting is quite promising when counting people in a high-density crowd, it has the following limitations: (i) For images with a low- 978-1-7281-6926-2/20/$31.00 ©2020 IEEE