Convolutional Neural Networks for Croatian Traffic Signs Recognition Vedran Vukoti´ c, Josip Krapac and Siniša Šegvi´ c University of Zagreb - Faculty of Electrical Engineering and Computing Zagreb, HR-10000, Croatia Email: vevukotic@gmail.com Abstract—We present an approach to recognition of Croatian traffic signs based on convolutional neural networks (CNNs). A library for quick prototyping of CNNs, with an educational scope, is first developed 1 . An architecture similar to LeNet-5 is then created and tested on the MNIST dataset of handwritten digits where comparable results were obtained. We analyze the FER- MASTIF TS2010 dataset and propose a CNN architecture for traffic sign recognition. The presented experiments confirm the feasibility of CNNs for the defined task and suggest improvements to be made in order to improve recognition of Croatian traffic signs. I. I NTRODUCTION Traffic sign recognition is an example of the multiple classes recognition problem. Classical approaches to this prob- lem in computer vision typically use the following well- known pipeline: (1) local feature extraction (e.g. SIFT), (2) feature coding and aggregation (e.g. BOW) and (3) learning a classifier to recognize the visual categories using the chosen representation (e.g. SVM). The downsides of these approaches include the suboptimality of the chosen features and the need for hand-designing them. CNNs approach this problem by learning meaningful repre- sentations directly from the data, so the learned representations are optimal for the specific classification problem, thus elim- inating the need for hand-designed image features. A CNN architecture called LeNet-5 [1] was successfully trained for handwritten digits recognition and tested on the MNIST dataset [2] yielding state-of-art results at the time. An improved and larger CNN was later developed [3] and current state-of-the-art results on the GTSRB dataset [4] were obtained. Following the results by [3], we were motivated to evaluate a similar architecture on the Croatian traffic signs dataset FER- MASTIF TS2010 [5]. To do so, we first developed a library that would allow us to test different architectures easily. After different subsets were tested for successful convergence, an architecture similar to LeNet-5 was built and tested on the MNIST dataset, yielding satisfactory results. Following the successful reproduction of a handwritten digit classifier (an error rate between 1.7% and 0.8%, where LeNet-X architec- tures yield their results), we started testing architectures for a subset of classes of the FER-MASTIF TS2010 dataset. In the first part of this article, CNNs are introduced and their specifics, compared to classical neural networks, are pre- sented. Ways and tricks for training them are briefly explained. 1 Available at https://github.com/v-v/CNN/ In the second part the datasets are described and the choice of a subset of classes for the FER-MASTIF TS2010 dataset is elaborated. In the last part of the paper, the experimental setup is explained and the results are discussed. Finally, common problems are shown and suggestions for future improvements are given. II. ARCHITECTURAL SPECIFICS OF CNNS Convolutional neural networks represent a specialization of generic neural networks, where the individual neurons form a mathematical approximation of the biological visual receptive field [6]. Visual receptive fields correspond to small regions of the input that are processed by the same unit. The receptive fields of the neighboring neurons overlap, allowing thus robust- ness of the learned representation to small translations of the input. Each receptive field learns to react to a specific feature (automatically learned as a kernel). By combining many layers, the network forms a classifier that is able to automatically learn relevant features and is less prone to translational variance in data. In this section, the specific layers (convolutional and pooling layers) of CNNs will be explained. A CNN is finally built by combining many convolutional and pooling layers, so the number of output in each successive layer grows, while size of images on the output is reducing. The output of the last CNN layer is a vector image representation. This image representation is then classified using a classical fully- connected MLP [3], or another classifier, e.g. an RBF network [1]). INPUT IMAGE FEATURE MAPS FEATURE MAPS FEATURE MAPS HIDDEN LAYER OUTPUT LAYER KERNELS { { { CONVOLUTION POOLING FULLY CONNECTED LAYERS Fig. 1: Illustration of the typical architecture and the different layers used in CNNs. Many convolutional and pooling layers are stacked. The final layers consist of a fully connected network. A. Feature maps Fig. 2 shows a typical neuron (a) and a feature map (b). Neurons typically output a scalar, while feature maps represent Proceedings of the Croatian Computer Vision Workshop, Year 2 September 16, 2014, Zagreb, Croatia CCVW 2014 Computer Vision in Traffic and Transportation 15