Vol.:(0123456789) 1 3
International Journal of Machine Learning and Cybernetics
https://doi.org/10.1007/s13042-018-00904-3
ORIGINAL ARTICLE
Big data aggregation in the case of heterogeneity: a feasibility study
for digital health
Alex Adim Obinikpo
1
· Burak Kantarci
1
Received: 5 March 2018 / Accepted: 14 December 2018
© Springer-Verlag GmbH Germany, part of Springer Nature 2019
Abstract
In big data applications, an important factor that may afect the value of the acquired data is the missing data, which arises
when data is lost either during acquisition or during storage. The former can be a result of faulty acquisition devices or
non responsive sensors whereas the latter can occur as a result of hardware failures at the storage units. In this paper, we
consider human activity recognition as a case study of a typical machine learning application on big datasets. We conduct
a comprehensive feasibility study on the fusion of sensory data that is acquired from heterogeneous sources. We present
insights on the aggregation of heterogeneous datasets with minimal missing data values for future use. Our experiments on
the accuracy, F-1 score, and PPV of various key machine learning algorithms show that sensory data acquired by wearables
are less vulnerable to missing data and smaller training sets whereas smart portable devices require larger training sets to
reduce the impacts of possibly missing data.
Keywords Dedicated sensors · Non-dedicated sensors · Aggregation
1 Introduction
With the phenomenal advent of the Big Data phenomenon
in smart environments, the ultimate goal of integrating big
data analytics methodologies with smart services is to ensure
the quality of service for the end users. Ensuring the service
quality can be achieved by the development and integra-
tion of efective and efcient data acquisition techniques,
as well as the application of improved methodologies on
the acquired data [1, 2]. Among the smart environments
that have been thriving in the Big Data Era, digital health
(D-Health) is becoming robust and more practical due to the
proliferation of the big data space with various data acquisi-
tion devices like smart phones and wearables [3, 4]. While
design and implementation of D-Health systems take beneft
of big data analytics [5–7], services that are ofered through
these systems are various; digital patients assistant, auto-
mated feedback systems are just to mention a few [8].
Figure 1 illustrates a broad overview of big data and its
applications in D-Health as a layered model. The frst layer
contains data sources such as wearables or smart devices,
and acquisition methods for collecting sensory data in vari-
ous formats. Besides being large in volume, the acquired
data can be in various format and even unstructured. The
second layer contains processing modules for acquired data.
These include selection of appropriate features, trimming the
dataset, aggregation of multi-sensory data and transforma-
tion of the aggregated data to the ready-to-analyze format.
The third layer, namely the data analytics layer, takes the
processed data as an input, and calls machine learning algo-
rithms on frameworks (e.g. Hadoop) or knowledge analysis
software (e.g. WEKA). The results of the analytics layer are
conveyed to the application layer where report views and
visualization are communicated to the end users through the
web-tier (e.g. mobile app).
As widely known, the big data has become phenomenal
because of the 5 V s, namely the volume, velocity, variety,
veracity and value. Among the 5 V s, volume is an important
aspect as it is related to efciency and scalability of analytics
solutions. On the other hand, the value of data is contributed
by quantitative (i.e. volume) and qualitative factors (i.e. use-
fulness of content).
Various forms of data are being collected from many
devices with various operating systems, sampling frequen-
cies and battery capacities. Under such heterogeneity, the
* Burak Kantarci
burak.kantarci@uottawa.ca
1
University of Ottawa, Ottawa, Canada