Vol.:(0123456789) 1 3 International Journal of Machine Learning and Cybernetics https://doi.org/10.1007/s13042-018-00904-3 ORIGINAL ARTICLE Big data aggregation in the case of heterogeneity: a feasibility study for digital health Alex Adim Obinikpo 1  · Burak Kantarci 1 Received: 5 March 2018 / Accepted: 14 December 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract In big data applications, an important factor that may afect the value of the acquired data is the missing data, which arises when data is lost either during acquisition or during storage. The former can be a result of faulty acquisition devices or non responsive sensors whereas the latter can occur as a result of hardware failures at the storage units. In this paper, we consider human activity recognition as a case study of a typical machine learning application on big datasets. We conduct a comprehensive feasibility study on the fusion of sensory data that is acquired from heterogeneous sources. We present insights on the aggregation of heterogeneous datasets with minimal missing data values for future use. Our experiments on the accuracy, F-1 score, and PPV of various key machine learning algorithms show that sensory data acquired by wearables are less vulnerable to missing data and smaller training sets whereas smart portable devices require larger training sets to reduce the impacts of possibly missing data. Keywords Dedicated sensors · Non-dedicated sensors · Aggregation 1 Introduction With the phenomenal advent of the Big Data phenomenon in smart environments, the ultimate goal of integrating big data analytics methodologies with smart services is to ensure the quality of service for the end users. Ensuring the service quality can be achieved by the development and integra- tion of efective and efcient data acquisition techniques, as well as the application of improved methodologies on the acquired data [1, 2]. Among the smart environments that have been thriving in the Big Data Era, digital health (D-Health) is becoming robust and more practical due to the proliferation of the big data space with various data acquisi- tion devices like smart phones and wearables [3, 4]. While design and implementation of D-Health systems take beneft of big data analytics [57], services that are ofered through these systems are various; digital patients assistant, auto- mated feedback systems are just to mention a few [8]. Figure 1 illustrates a broad overview of big data and its applications in D-Health as a layered model. The frst layer contains data sources such as wearables or smart devices, and acquisition methods for collecting sensory data in vari- ous formats. Besides being large in volume, the acquired data can be in various format and even unstructured. The second layer contains processing modules for acquired data. These include selection of appropriate features, trimming the dataset, aggregation of multi-sensory data and transforma- tion of the aggregated data to the ready-to-analyze format. The third layer, namely the data analytics layer, takes the processed data as an input, and calls machine learning algo- rithms on frameworks (e.g. Hadoop) or knowledge analysis software (e.g. WEKA). The results of the analytics layer are conveyed to the application layer where report views and visualization are communicated to the end users through the web-tier (e.g. mobile app). As widely known, the big data has become phenomenal because of the 5 V s, namely the volume, velocity, variety, veracity and value. Among the 5 V s, volume is an important aspect as it is related to efciency and scalability of analytics solutions. On the other hand, the value of data is contributed by quantitative (i.e. volume) and qualitative factors (i.e. use- fulness of content). Various forms of data are being collected from many devices with various operating systems, sampling frequen- cies and battery capacities. Under such heterogeneity, the * Burak Kantarci burak.kantarci@uottawa.ca 1 University of Ottawa, Ottawa, Canada