Automatic Control and Information Sciences, 2017, Vol. 3, No. 1, 8-15 Available online at http://pubs.sciepub.com/acis/3/1/3 ©Science and Education Publishing DOI:10.12691/acis-3-1-3 Heterogeneous Data and Big Data Analytics Lidong Wang * Department of Engineering Technology, Mississippi Valley State University, Itta Bena, MS, USA *Corresponding author: lwang22@students.tntech.edu Abstract Heterogeneity is one of major features of big data and heterogeneous data result in problems in data integration and Big Data analytics. This paper introduces data processing methods for heterogeneous data and Big Data analytics, Big Data tools, some traditional data mining (DM) and machine learning (ML) methods. Deep learning and its potential in Big Data analytics are analysed. The benefits of the confluences among Big Data analytics, deep learning, high performance computing (HPC), and heterogeneous computing are presented. Challenges of dealing with heterogeneous data and Big Data analytics are also discussed. Keywords: Big Data, Big Data analytics, heterogeneous data, deep learning, data mining, machine learning, heterogeneous computing, computational intelligence, artificial intelligence Cite This Article: Lidong Wang, “Heterogeneous Data and Big Data Analytics.” Automatic Control and Information Sciences, vol. 3, no. 1 (2017): 8-15. doi: 10.12691/acis-3-1-3. 1. Introduction Heterogeneous data are any data with high variability of data types and formats. They are possibly ambiguous and low quality due to missing values, high data redundancy, and untruthfulness. It is difficult to integrate heterogeneous data to meet the business information demands. For example, heterogeneous data are often generated from Internet of Things (IoT). Data generated from IoT often has the following four features [1]. First, they are of heterogeneity. Because of the variety of data acquisition devices, the acquired data are also different in types with heterogeneity. Second, they are at a large-scale. Massive data acquisition equipment is used and distributed, not only the currently acquired data, but also the historical data within a certain time frame should be stored. Third, there is a strong correlation between time and space. Every data acquisition device is placed at a specific geographic location and every piece of data has a time stamp. The time and space correlation is an important property of data from IoT. Fourth, effective data accounts for only a small portion of the big data. A great quantity of noises may be collected during the acquisition and transmission of data in IoT. Among datasets acquired by acquisition devices, only a small amount of data is valuable. There are following types of data heterogeneity [2]: • Syntactic heterogeneity occurs when two data sources are not expressed in the same language. • Conceptual heterogeneity, also known as semantic heterogeneity or logical mismatch, denotes the differences in modelling the same domain of interest. • Terminological heterogeneity stands for variations in names when referring to the same entities from different data sources. • Semiotic heterogeneity, also known as pragmatic heterogeneity, stands for different interpretation of entities by people. Data representation can be described at four levels [3]. Level 1 is diverse raw data with different types and from different sources. Level 2 is called ‘unified representation’. Heterogeneous data needs to be unified. Also, too much data can lead to high cognitive and data processing costs. This layer converts individual attributes into information in terms of ‘what-when-where’. Level 3 is aggregation. Spatial data can be naturally represented in the form of spatial grids with thematic attributes. Processing operators are segmentation and aggregation, etc. Aggregation aids easy visualization and provides an intuitive query. Level 4 is called ‘situation detection and representation’. The situation at a location is characterized based on spatiotemporal descriptors determined by using appropriate operators at level 3. The final step in situation detection is a classification operation that uses domain knowledge to assign an appropriate class to each cell. Metadata are crucial for future querying. For relational tables and some Extensible Markup Language (XML) documents, explicit schema deﬁnitions in Structured Query Language (SQL), XML Schema Definition (XSD), or Document Type Definition (DTD) can be directly obtained from sources and integrated into a metamodel. The XML technique is used for data translation. The tricky part is semi-structured data (such as XML without XSD, JSON, or partially structured Excel or CSV ﬁles) which contain implicit schemas. Therefore, the component Structural Metadata Discovery (SMD) takes over the responsibility of discovering implicit metadata (e.g., entity types, relationship types, and constraints) from semi-structured data [4,5]. Metadata management issues are important. For an appropriate interpretation of heterogeneous big data, detailed metadata are required. Some reports contain some metadata, but many more details such as about the specific sensor used in data