VISTA: Fast and Efﬁcient Trafﬁc Surveillance by Tile Sampling Shubham Chaudhary * , Aryan Taneja * , Anjali Singh † , Sohum Sikdar * , Mukulika Maity * , Arani Bhattacharya * * Indraprastha Institute of Information Technology Delhi, † Indira Gandhi Delhi Technical University For Women, Kashmere Email: * {shubhamch, aryan19027, sohum20339, mukulika, arani}@iiitd.ac.in, † anjali038btcse19@igdtuw.ac.in Abstract—With the increasing number of vehicles in modern cities, trafﬁc surveillance via cameras on roads has become an important application. Cities have installed thousands of cameras on roads, which send video feeds to a cloud center to run computer vision algorithms. This requires high bandwidth. Current techniques reduce the bandwidth requirement by either sending a limited number of frames/pixels/regions or relying on re-encoding the important parts of the video. This requires running DNNs to extract important portions in a frame so that they can be again sent at a higher resolution from the camera to the server. This has the disadvantage of imposing signiﬁcant overhead on the camera side compute, as re-encoding is known to be expensive, and makes the system less real-time. In this work, we propose VISTA, a system that utilizes tile sampling, where a limited number of rectangular areas within the frames, known as tiles, are sent to the server. We then propose an adaptive tile sampling algorithm, that estimates the presence of moving objects by comparing the statistics of the tiles’ bitrate (in kbps) and then decide to retain only the necessary tiles, thus eliminating the requirement to use a DNN at the camera side. We evaluate VISTA on different datasets having 56 videos in total to show that on average our technique reduces 17-40% of the total amount of data sent to the cloud while providing a detection accuracy of over 85%. Furthermore, VISTA also runs in real-time even on cheap edge devices like Raspberry Pi and nVidia Jetson Nano. Further, it requires minimal calibration compared to prior works. I. I NTRODUCTION Recently, real-time trafﬁc surveillance has become impor- tant for automatic enforcement of trafﬁc rules [1], control of trafﬁc lights [2], and the detection of anomalous events like accidents [3]. Cities like Shanghai, London, and New Delhi have installed hundreds of thousands of surveillance cameras 1 . The video feeds generated from these cameras are either processed locally or sent to the data centers for applying computer vision algorithms. These algorithms depend on running deep neural networks (DNNs) which are inherently compute-intensive. Local processing using such algorithms requires expensive hardware (like GPUs and NPUs) installed along with the cameras, which increases the cost of trafﬁc surveillance substantially and makes it less scalable. On the other hand, a major challenge faced by techniques that send video feeds to cloud server is that the amount of data generated is very high, going up to 1Mbps per camera [4], leading to high bandwidth consumption. A data center catering to a 1 https://www.ﬁnancialexpress.com/auto/industry/how-trafﬁc-cameras- work-and-issue-challans-violation-tracking-ﬁning-explained-delhi-mumbai- ﬁnes/2200053/ city would, therefore, require terabytes of data ingestion per second, which is very difﬁcult to achieve in practice. Thus, it is essential to ﬁnd techniques to reduce bandwidth consumption without sacriﬁcing on the quality of trafﬁc surveillance. Current techniques of reducing bandwidth typically utilize one or more of two techniques. The ﬁrst technique is to utilize a heuristic to intelligently select the frames of a video that should be sent to the cloud server [5], [6]. While this technique can save a lot of bandwidth during off-peak hours, it is difﬁcult to save bandwidth when the trafﬁc is congested. Moreover, this technique needs either integration of the algorithm in the camera’s ﬁrmware or a re-encoding on the device directly connected to the camera. Adding such capability to the cameras or devices attached to cameras used in practice would require a substantial amount of investment. The second technique is to perform additional computation on the cloud server by either running more powerful models [7] or sending a signal to the camera to send additional data only when needed [4]. This technique saves bandwidth at the cost of additional GPU usage, which is also expensive and energy- intensive. Thus, a solution that runs without adding to the computation while also being simple to integrate with existing systems is essential to reduce the cost of trafﬁc surveillance. One possible way of solving the problem of high bandwidth consumption is to send only the objects or frames of interest to the server, as in DDS [4] and Reducto [6]. However, video is usually encoded so that pixels are deﬁned as an offset of their neighboring pixel values as a compression strategy, where the neighbors could be either spatial or temporal. Thus, only sending the objects or frames of interest would require re- encoding, which is compute-intensive. We avoid the problem of re-encoding in the following way. Recent video standards like HEVC (High Efﬁciency Video Codec), also known as H.265, allow the videos to be split into spatial rectangular blocks called tiles (see Figure 1 (b)). Tiled encoding enforces the constraint of avoiding using pixels across tiles for encoding, thus making each tile an independent spatially encoded unit. This allows us to send tiles containing only the objects of interest while omitting the other tiles. The key advantage of removing tiles is that it runs in real-time even on embedded platforms like Raspberry Pi, unlike ﬁltering frames or objects of interest and re-encoding videos. As surveillance cameras with native HEVC support increasingly become available [8], removing tiles is, therefore, easy to integrate into actually deployed surveillance systems.