Thermal Imaging on Smart Vehicles for Person and Road Detection: Can a Lazy Approach Work? Galadrielle Humblot-Renaux * , Vivian Li * , Daniela Pinto * , and Letizia Marchegiani Abstract— This paper proposes the addition of a thermal camera to an RGB system with the goal of improving person and road detection reliability in unfavorable weather and illumi- nation conditions. Custom data is gathered on an experimental vehicle and used for development and testing. For person detection, we propose a novel multi-modal approach, where bounding boxes are initially obtained from RGB and thermal images using YOLOv3-tiny. We then identify high-intensity connected components in thermal images to compensate for missed detections. Detections from the two cameras and the two algorithms are finally weighed and combined into a confidence map. Using the proposed fusion method, recall and precision are improved compared to using RGB only, without the need to retrain the network. For thermal-based road segmentation, we achieve an average precision of 94.2% after re-training MultiNet’s KittiSeg decoder on a small thermal dataset, while using pre-trained weights for MultiNet’s VGG-based encoder. These results show that the addition of thermal cameras to perception systems of autonomous vehicles can bring substantial benefits with minimal labelling, implementation effort and training requirements. I. I NTRODUCTION Autonomous vehicles rely on exteroceptive sensors to find a navigable path while avoiding obstacles. Traditionally, cameras are the sensor of choice for detecting obstacles in the scene. However, these often fall short when facing non-ideal illumination and weather conditions, as they are inherently sensitive to any visual change in the scene, such as darkness, fog, rain or glare from the sun [1]. Other sensor modalities have been used for similar purposes, such as LIDAR [2] and microphones [3], [4]; yet, while LIDAR also suffers in harsh weather conditions (e.g. heavy rain, fog), acoustic sensing cannot, alone, provide full understanding of the environment. Radar is currently considered a valid solution, as quite resilient to a wide range of weather conditions, and able to detect objects at long range [5]. However, despite the recent progress in this direction (e.g. [6]), the interpretation of radar data remains challenging, due also to the presence of noise and unwanted artifacts. This imposes significant limitations when trying to leverage existing tools in computer vision to parse the data, and when creating a labelled dataset for object detection tasks in radar scans. In this work, we evaluate the benefits and potential of adding a thermal camera to an autonomous vehicle for urban environment understanding. The vehicle we employ in this study is an experimental golf-cart which operates on Authors are members of the Department of Electronic Systems, Aalborg University, Denmark, {ghumbl19, vli16, dpinto16}@student.aau.dk; lm@es.aau.dk * Authors contributed equally to this work. a university campus, driving primarily on unmarked roads and bicycle lanes with heavy pedestrian traffic. Given this application context, we focus our investigation on two crucial tasks: person and road detection. However, our findings could be extended to other detection tasks (e.g. vehicle detection). Much like traditional cameras, thermal cameras provide the visual cues necessary to not only detect obstacles, but also to distinguish among different types of objects. They also share many of the useful properties of the radar: indeed, they are not sensitive to visible light, they do not rely on any illumination source, and do not “see” on-coming headlights, smoke, haze, etc. For this reason, they can be used to detect heat sources, such as people, through rain, snow or fog, even though these conditions may lead to a decrease in range or contrast [7]. Compared to radars, thermal cameras provide a much more humanly intuitive representation of the environment, simplifying the labelling process. Furthermore, given the nature of the data, computer vision methods and techniques normally adopted in the RGB domain could be adapted and employed with minimal effort. In this study, we propose a novel method for multi-modal person detection, where the predictions obtained on RGB and thermal images are weighed and combined into a single confidence map. Firstly, we generate bounding box estimates by employing a YOLOv3-tiny (You Only Look Once) archi- tecture [8] on both kinds of data (i.e. RGB and thermal im- ages). The network is used with pre-trained weights, without the need for additional retraining, or the need to generate a labelled training set of thermal images. Secondly, connected components in thermal images are identified and employed to compensate for missed detections. Predictions are lastly scored and integrated into a confidence map. Additionally, we present a thermal image-based road detection framework, implemented through a MultiNet architecture [9], using pre- trained VGG16 weights for the encoder [10], and only re- training the KittiSeg segmentation decoder, such that the network is trained with very little thermal data. Note that we use the terms “road segmentation” and “road detection” interchangeably throughout this paper. Our evaluation, based on real data collected with our experimental vehicle, demon- strates that thermal cameras could be a compelling alternative for vision-based systems operating on autonomous vehicles, both if used as a single modality, and in combination with RGB cameras. By taking a “lazy” approach which leverages existing deep learning networks pre-trained on RGB data, we also show that enabling thermal vision on smart vehicles does not necessarily require developing dedicated architectures or annotating large datasets.