Accuracy vs. traffic trade-off of Learning IoT Data Patterns at the Edge with Hypothesis Transfer Learning Lorenzo Valerio Institute for Informatics and Telematics National Research Council Pisa, Italy Email: lorenzo.valerio@iit.cnr.it Andrea Passarella Institute for Informatics and Telematics National Research Council Pisa, Italy Email: andrea.passarella@iit.cnr.it Marco Conti Institute for Informatics and Telematics National Research Council Pisa, Italy Email: marco.conti@iit.cnr.it Abstract—Right now, the dominant paradigm to support knowledge extraction from raw IoT data is through global cloud platform, where data is collected from IoT devices, and analysed. However, with the ramping trend of the number of IoT devices spread in the physical environment, this approach might simply not scale. The data gravity concept, one of the basis of Fog and Mobile Edge Computing, points towards a decentralisation of computation for data analysis, whereby the latter is performed closer to where data is generated. Along this trend, in this paper we explore the accuracy vs. network traffic trade-off when using Hypothesis Transfer Learning (HTL) to learn patterns from data generated in a set of distributed physical locations. HTL is a standard machine learning technique used to train models on separate disjoint training sets, and then transfer the partial models (instead of the data) to reach a unique learning model. We have previously applied HTL to the problem of learning human activities when data are available in different physical locations (e.g., areas of a city). In our approach, data is not moved from where it is generated, while partial models are exchanged across sites. The HTL-based approach achieves lower (though acceptable) accuracy with respect to a conventional solution based on global cloud computing, but drastically cuts the network traffic. In this paper we explore the trade-off between accuracy and traffic, by assuming that data are moved to a variable number of data collectors where partial learning is performed. Centralised cloud and completely decentralised HTL are the two extremes of the spectrum. Our results show that there is no significant advantage in terms of accuracy, in using fewer collectors, and that therefore a distributed HTL solutions, along the lines of a fog computing approach, is the most promising one. I. I NTRODUCTION There is unanimous consensus in the research and industry communities about the fact that IoT applications will account for a quite significant (and increasing day-by-day) share of the Big Data that we will have to manage in the near future [1]. Many reference market analyses (e.g. [2]), show that IoT is one of the technologies (not only in the ICT domain) bound to have the biggest economic potential. As most of the value of IoT applications will come from the analysis of the data generated by IoT devices, the research area of IoT data analysis and management is a very challenging and exciting one. The common trend in most of current architectures [3] is to transfer IoT data from the physical locations where they are generated, to some global cloud platform, where knowledge is extracted from raw data, and used to support IoT applications. This is the case, among others, of the ETSI M2M architecture [4]. However, there are concerns whether this approach will be sustainable in the long run. The projections of growth of the number of deployed IoT devices are exponential over the next years [2]. Together with data generated by personal users’ devices, which are also likely to be part of IoT applications, this is likely to make the amount of data generated at the edge of the network huge, making it impractical or simply impossible to transfer them to a remote cloud platform. In addition, data might also have privacy and confidentiality constraints, which might make it impossible to transfer them to third parties such as global cloud platform operators. These trends push towards a decentralisation of cloud platforms towards the edge of the network, according to the Fog [5] and Mobile Edge Computing [6] paradigms. In this paper we follow this approach, and study the be- haviour of a distributed learning solution based on Hypothesis Transfer Learning (HTL). In general, in HTL, instead of training a model on the whole training set, multiple parallel models are trained on disjoint subsets, and then the partial models are combined to obtain a single final model. We have applied HTL to the case of distributed learning in IoT environ- ments in [7], where we have presented an activity recognition solution. Specifically, in our use case data collected from users’ personal devices are available in a number of disjoint physical locations, where partial learning models are trained. HTL is used to exchange and combine these partial models and obtain a unique model. In [7] we have shown that this solution is able to drastically cut the network traffic required to perform the learning task, with an affordable reduction of accuracy with respect to a conventional solution where data is transferred on a global cloud platform. In this paper, we study the accuracy vs. network traffic trade-off of this solution, when a variable number of Data Collectors (DCs) is used. Specifically, we assume that disjoint