Automatic Control and Information Sciences, 2017, Vol. 3, No. 1, 8-15
Available online at http://pubs.sciepub.com/acis/3/1/3
©Science and Education Publishing
DOI:10.12691/acis-3-1-3
Heterogeneous Data and Big Data Analytics
Lidong Wang
*
Department of Engineering Technology, Mississippi Valley State University, Itta Bena, MS, USA
*Corresponding author: lwang22@students.tntech.edu
Abstract Heterogeneity is one of major features of big data and heterogeneous data result in problems in data
integration and Big Data analytics. This paper introduces data processing methods for heterogeneous data and Big
Data analytics, Big Data tools, some traditional data mining (DM) and machine learning (ML) methods. Deep
learning and its potential in Big Data analytics are analysed. The benefits of the confluences among Big Data
analytics, deep learning, high performance computing (HPC), and heterogeneous computing are presented.
Challenges of dealing with heterogeneous data and Big Data analytics are also discussed.
Keywords: Big Data, Big Data analytics, heterogeneous data, deep learning, data mining, machine learning,
heterogeneous computing, computational intelligence, artificial intelligence
Cite This Article: Lidong Wang, “Heterogeneous Data and Big Data Analytics.” Automatic Control and
Information Sciences, vol. 3, no. 1 (2017): 8-15. doi: 10.12691/acis-3-1-3.
1. Introduction
Heterogeneous data are any data with high variability of
data types and formats. They are possibly ambiguous and
low quality due to missing values, high data redundancy,
and untruthfulness. It is difficult to integrate heterogeneous
data to meet the business information demands. For
example, heterogeneous data are often generated from
Internet of Things (IoT). Data generated from IoT often
has the following four features [1]. First, they are of
heterogeneity. Because of the variety of data acquisition
devices, the acquired data are also different in types with
heterogeneity. Second, they are at a large-scale. Massive
data acquisition equipment is used and distributed, not
only the currently acquired data, but also the historical
data within a certain time frame should be stored. Third,
there is a strong correlation between time and space.
Every data acquisition device is placed at a specific
geographic location and every piece of data has a time
stamp. The time and space correlation is an important
property of data from IoT. Fourth, effective data accounts
for only a small portion of the big data. A great quantity of
noises may be collected during the acquisition and
transmission of data in IoT. Among datasets acquired by
acquisition devices, only a small amount of data is
valuable. There are following types of data heterogeneity
[2]:
• Syntactic heterogeneity occurs when two data
sources are not expressed in the same language.
• Conceptual heterogeneity, also known as semantic
heterogeneity or logical mismatch, denotes the
differences in modelling the same domain of
interest.
• Terminological heterogeneity stands for variations
in names when referring to the same entities from
different data sources.
• Semiotic heterogeneity, also known as pragmatic
heterogeneity, stands for different interpretation of
entities by people.
Data representation can be described at four levels [3].
Level 1 is diverse raw data with different types and from
different sources. Level 2 is called ‘unified representation’.
Heterogeneous data needs to be unified. Also, too much
data can lead to high cognitive and data processing costs.
This layer converts individual attributes into information
in terms of ‘what-when-where’. Level 3 is aggregation.
Spatial data can be naturally represented in the form of
spatial grids with thematic attributes. Processing operators
are segmentation and aggregation, etc. Aggregation aids
easy visualization and provides an intuitive query. Level 4
is called ‘situation detection and representation’. The situation
at a location is characterized based on spatiotemporal
descriptors determined by using appropriate operators
at level 3. The final step in situation detection is a
classification operation that uses domain knowledge to
assign an appropriate class to each cell.
Metadata are crucial for future querying. For relational
tables and some Extensible Markup Language (XML)
documents, explicit schema definitions in Structured
Query Language (SQL), XML Schema Definition (XSD),
or Document Type Definition (DTD) can be directly
obtained from sources and integrated into a metamodel.
The XML technique is used for data translation. The
tricky part is semi-structured data (such as XML without
XSD, JSON, or partially structured Excel or CSV
files) which contain implicit schemas. Therefore, the
component Structural Metadata Discovery (SMD) takes
over the responsibility of discovering implicit metadata
(e.g., entity types, relationship types, and constraints)
from semi-structured data [4,5]. Metadata management
issues are important. For an appropriate interpretation of
heterogeneous big data, detailed metadata are required.
Some reports contain some metadata, but many more
details such as about the specific sensor used in data