Semantic Extract-Transform-Load framework for Big Data Integration Srividya K Bansal Arizona State University, Mesa, AZ, USA srividya.bansal@asu.edu Sebastian Kagemann Indiana University – Bloomington, IN, USA sakagema@umail.iu.edu Abstract— Big Data researchers are dealing with the Variety of data that includes various formats such as structured, numeric, unstructured text data, email, video, and audio. The proposed Semantic Extract-Transform-Load (ETL) framework that uses semantic technologies to integrate and publish data from multiple sources as open linked data provides an extensible solution for effective data integration, facilitating the creation of smart urban apps for smarter living. A case study that integrates datasets, using the proposed framework, from various Massive Open Online Courses and Household travel data along with Fuel Economy data is presented. Keywords—data integration; linked data; ontology engineering; semantic technologies Big Data comprises of data consisting of billions to trillions of records of millions of people - all from different sources (e.g. Web, customer contact center, social media, mobile data, sales, etc.). The data is typically loosely structured and is often incomplete and inaccessible. Big Data is transforming science, engineering, medicine, healthcare, finance, business, and ultimately society itself. Massive amounts of data are available to be harvested for competitive business advantage, government policies, and new insights into a broad array of applications (including healthcare, biomedicine, energy, smart cities, genomics, transportation, etc.). Yet, most of this data is inaccessible to users, as we need technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision- making [1]. The research community also agrees that it is important to engineer Big Data meaningfully [2]. Meaningful data integration in a schema-less, and complex Big Data world of databases is a big open challenge. Big Data research is usually discussed in the areas of 3V’s – Volume (storage of massive amount of data streaming in from social media, sensors, and machine-to-machine data being collected), Velocity (reacting quickly enough to deal with data in near-real time), and Variety (data is in various formats such as structured, numeric, unstructured text data, email, video, audio, stock ticker, etc.). Big Data challenges are not only in storing and managing this variety of data but also extracting and analyzing consistent information from it. Researchers are working on creating a common conceptual model for the integrated data [3]. The method of publishing and linking structured data on the web is called Linked Data. This data is machine- readable, its meaning is explicitly defined, it is linked to other external data sets, and it can be linked to from other data sets as well. The Linked Open Data (LOD) community effort has led to a huge data space, with 31 billion Resource Description Framework (RDF) [4] triples, and a W3C specification for data interchange on the web [5]. LOD can be used in a number of interesting Web and mobile applications. Linking Open Government Data (LOGD) project [6] investigates translating government-related data using Semantic web technologies. LOD has gained significant adoption and momentum, though the quality of the interconnecting relationships remains questionable [7]. IBM Smarter City initiative aims at creating cities that are vital and safe for its citizens and businesses. Their focus is on building the infrastructure for fundamental services— such as roadways, mass transit and utilities that make a city desirable and livable. IEEE Smart Cities Initiative brings together technology, government and society to enable smart economy, mobility, environment, living, and governance. Both these initiatives have to integrate and use information from various data sources in addition to setting up the required infrastructure. Government agencies are also increasingly making their data accessible through initiatives such as data.gov to promote transparency and economic growth [8]. We need ways to organize variety of data such that concepts with similar meaning are related through links, while the concepts that are distinct are clearly represented as well with semantic metadata. This will allow effective and creative use of query engines and analytic tools for Big Data, which is absolutely essential to create smart and sustainable environments. Figure 1 shows the future vision of a web portal with Linked Open Urban data integrated and published from various sources and domains. The need to integrate Big Data has been heightened in recent years due to a growing demand and interest in mobile applications for improving quality of life in urban cities. Here is an example where various data sources can be used: a traffic jam that emerges due to an unplanned protest may be captured through a Twitter stream, but missed when examining weather conditions, event databases, reported roadwork, etc. Additionally, weather sensors in the city tend to miss localized events such as flooding. These views of the city combined however, can provide a richer and more complete view of the state of the city, by merging traditional data sources with messy and unreliable social media streams thereby contributing to smart living, environment, economy, mobility, and governance. Such applications rely on Big Data available to the public via the cloud. As outlined in the latest McKinsey Global Institute report, we’re now seeing the global economy beginning to operate in real time [9]. The total value generation for the impact of new data technologies will be measured in trillions of dollars globally according to this report. The National