Convolutional Neural Networks for Croatian Trafﬁc Signs Recognition Vedran Vukoti´ c, Josip Krapac and Siniša Šegvi´ c University of Zagreb - Faculty of Electrical Engineering and Computing Zagreb, HR-10000, Croatia Email: vevukotic@gmail.com Abstract—We present an approach to recognition of Croatian trafﬁc signs based on convolutional neural networks (CNNs). A library for quick prototyping of CNNs, with an educational scope, is ﬁrst developed 1 . An architecture similar to LeNet-5 is then created and tested on the MNIST dataset of handwritten digits where comparable results were obtained. We analyze the FER- MASTIF TS2010 dataset and propose a CNN architecture for trafﬁc sign recognition. The presented experiments conﬁrm the feasibility of CNNs for the deﬁned task and suggest improvements to be made in order to improve recognition of Croatian trafﬁc signs. I. I NTRODUCTION Trafﬁc sign recognition is an example of the multiple classes recognition problem. Classical approaches to this prob- lem in computer vision typically use the following well- known pipeline: (1) local feature extraction (e.g. SIFT), (2) feature coding and aggregation (e.g. BOW) and (3) learning a classiﬁer to recognize the visual categories using the chosen representation (e.g. SVM). The downsides of these approaches include the suboptimality of the chosen features and the need for hand-designing them. CNNs approach this problem by learning meaningful repre- sentations directly from the data, so the learned representations are optimal for the speciﬁc classiﬁcation problem, thus elim- inating the need for hand-designed image features. A CNN architecture called LeNet-5 [1] was successfully trained for handwritten digits recognition and tested on the MNIST dataset [2] yielding state-of-art results at the time. An improved and larger CNN was later developed [3] and current state-of-the-art results on the GTSRB dataset [4] were obtained. Following the results by [3], we were motivated to evaluate a similar architecture on the Croatian trafﬁc signs dataset FER- MASTIF TS2010 [5]. To do so, we ﬁrst developed a library that would allow us to test different architectures easily. After different subsets were tested for successful convergence, an architecture similar to LeNet-5 was built and tested on the MNIST dataset, yielding satisfactory results. Following the successful reproduction of a handwritten digit classiﬁer (an error rate between 1.7% and 0.8%, where LeNet-X architec- tures yield their results), we started testing architectures for a subset of classes of the FER-MASTIF TS2010 dataset. In the ﬁrst part of this article, CNNs are introduced and their speciﬁcs, compared to classical neural networks, are pre- sented. Ways and tricks for training them are brieﬂy explained. 1 Available at https://github.com/v-v/CNN/ In the second part the datasets are described and the choice of a subset of classes for the FER-MASTIF TS2010 dataset is elaborated. In the last part of the paper, the experimental setup is explained and the results are discussed. Finally, common problems are shown and suggestions for future improvements are given. II. ARCHITECTURAL SPECIFICS OF CNNS Convolutional neural networks represent a specialization of generic neural networks, where the individual neurons form a mathematical approximation of the biological visual receptive ﬁeld [6]. Visual receptive ﬁelds correspond to small regions of the input that are processed by the same unit. The receptive ﬁelds of the neighboring neurons overlap, allowing thus robust- ness of the learned representation to small translations of the input. Each receptive ﬁeld learns to react to a speciﬁc feature (automatically learned as a kernel). By combining many layers, the network forms a classiﬁer that is able to automatically learn relevant features and is less prone to translational variance in data. In this section, the speciﬁc layers (convolutional and pooling layers) of CNNs will be explained. A CNN is ﬁnally built by combining many convolutional and pooling layers, so the number of output in each successive layer grows, while size of images on the output is reducing. The output of the last CNN layer is a vector image representation. This image representation is then classiﬁed using a classical fully- connected MLP [3], or another classiﬁer, e.g. an RBF network [1]). INPUT IMAGE FEATURE MAPS FEATURE MAPS FEATURE MAPS HIDDEN LAYER OUTPUT LAYER KERNELS { { { CONVOLUTION POOLING FULLY CONNECTED LAYERS Fig. 1: Illustration of the typical architecture and the different layers used in CNNs. Many convolutional and pooling layers are stacked. The ﬁnal layers consist of a fully connected network. A. Feature maps Fig. 2 shows a typical neuron (a) and a feature map (b). Neurons typically output a scalar, while feature maps represent Proceedings of the Croatian Computer Vision Workshop, Year 2 September 16, 2014, Zagreb, Croatia CCVW 2014 Computer Vision in Traﬃc and Transportation 15