From Simulation to Reality: CNN Transfer Learning for Scene Classiﬁcation Jordan J. Bird 1 , Diego R. Faria 2 , and Anik´ o Ek´ art 3 Aston Robotics, Vision and Intelligent Systems Lab Aston University Birmingham, United Kingdom Email: {birdj1 1 , d.faria 2 , a.ekart 3 }@aston.ac.uk Pedro P. S. Ayrosa Universidade Estadual de Londrina Londrina, Brazil Email: ayrosa@uel.br Abstract—In this work, we show that both ﬁne-tune learning and cross-domain sim-to-real transfer learning from virtual to real-world environments improve the starting and ﬁnal scene classiﬁcation abilities of a computer vision model. A 6-class computer vision problem of scene classiﬁcation is presented from both videogame environments and photographs of the real world, where both datasets have the same classes. 12 networks are trained from 2, 4, 8, . . . , 4096 hidden interpretation neurons following a ﬁne-tuned VGG16 Convolutional Neural Network for a dataset of virtual data gathered from the Unity game engine and for a photographic dataset gathered from an online image search engine. 12 Transfer Learning networks are then benchmarked using the trained networks on virtual data as a starting weight distribution for a neural network to classify the real-world dataset. Results show that all of the transfer networks have a higher starting accuracy pre-training, with the best showing an improvement of +48.34% image classiﬁcation ability and an average increase of +38.33% for the starting abilities of all hyperparameter sets benchmarked. Of the 12 experiments, nine transfer experiments showed an improvement over non-transfer learning, two showed a slightly lower ability, and one did not change. The best accuracy overall was obtained by a transfer learning model with a layer of 64 interpretation neurons scoring 89.16% compared to the non-transfer counterpart of 88.27%. An average increase of +7.15% was observed over all experiments. The main ﬁnding is that not only can a higher ﬁnal classiﬁcation accuracy be achieved, but strong classiﬁcation abilities prior to any training whatsoever are also encountered when transferring knowledge from simulation to real-world data, proving useful domain knowledge transfer between the datasets. Keywords—Sim-to-real, Transfer Learning, Deep Learning, Computer Vision, Autonomous Perception, Scene Classiﬁcation, Environment Recognition I. I NTRODUCTION The possibility of transfer learning from simulated data to real-world application is promising due to the scarcity of real-world labelled data being an issue encountered in many applications of machine learning and artiﬁcial intelligence [1], [2], [3]. Based on this, Fine-tune Learning and Transfer learning are often both considered to be viable solutions to the issue of data scarcity in the scientiﬁc state-of- the-art via large-scale models such as ImageNet and VGG16 for the former and methods such as rule and weight transfer for the latter [4], [5], [6]. Here, we attempt to perform both of these methods in a pipeline for scene classiﬁcation, by ﬁne-tuning a large-scale model and transferring knowledge between rules learnt from simulation to real-world datasets. The consumer-level quality of videogame technology has rapidly improved towards arguable photo-realistic graphical quality through ray-traced lighting, high resolution photographic textures and Physically Based Rendering (PBR) to name but several prominent techniques. This then raises the question, since simulated environments are ever more realistic, is it possible to transfer knowledge from them to real-world situations? Should this be possible, the problem of data scarcity would be mitigated, and also a more optimal process of learning would become possible by introducing a starting point learned from simulation. If this process provides a better starting point than, for example, a classical random weight distribution, then fewer computational resources are required to learn about the real world and also fewer labelled data points are required. In addition, if this process is improved further, learning from real-world data may not actually be required at all. In this work, we perform 12 individual topology exper- iments in order to show that real-world classiﬁcation of relatively scarce data can be improved via pre-training said models on simulation data from a high-quality videogame environment. The weights developed on simulation data are applied as a starting point for the backpropagation learning of real-world data, and we ﬁnd that both starting accuracies and asymptotes (ﬁnal ability) are often higher when the model has been able to train on simulation data before considering real data. The main scientiﬁc contributions of this work are threefold: 1) The formation of two datasets for a 6-class scene classi- ﬁcation dataset, both artiﬁcial simulation and real-world photographic data 1 . 2) 24 topology tuning experiments for best classiﬁcation of the two datasets, 12 for each of the datasets by 2, 4, 8...4096 interpretation neurons following the ﬁne 1 https://www.kaggle.com/birdy654/environment-recognition-simulation-to- reality 1