Semi-Structured Data Model for Big Data (SS-DMBD) Shady Hamouda 1,2 and Zurinahni Zainol 1 1 Universiti Sains Malaysia, Penang, Malaysia 2 Emirates College of Technology, Abu Dhabi, U.A.E. Keywords: Semi-Structured Data, Document-oriented Database, Big Data, NoSQL. Abstract: New business applications require flexibility in data model structure and must support the next generation of web applications and handle complex data types. The performance of processing structured data through a relational database has become incompatible with big data challenges. Nowadays, there is a need to deal with semi-structured data with a flexible schema for different applications. Not only SQL (NoSQL) has been presented to overcome the limitations of relational databases in terms of scale, performance, data model, and distribution system. Also, NoSQL supports semi-structured data and can handle a huge amount of data and provide flexibility in the data schema. But the data models of NoSQL systems are very complex, as there are no tools available to represent a scheme for NoSQL databases. In addition, there is no standard schema for data modelling of document-oriented databases. This study proposes a semi-structured data model for big data (SS-DMBD) that is compatible with a document-oriented database, and also proposes an algorithm for mapping the entity relationship (ER) model to SS-DMBD. A case study is used to evaluate the SS-DMBD and its features. The results show that this model can address most features of semi-structured data. 1 INTRODUCTION Nowadays, business applications require databases with the ability to support extreme scales and deal with many data formats. Information technology in the healthcare sector is shifting from structure-based data to semi-structured (Wang et al., 2018). Assunção et al. (2015) and Stanescu et al. (2016) discussed the challenges of big data, such as how to deal with increasing data volume and the need for a semi- structured data type to store and handle large amounts of data with a flexible schema. Assunção et al. (2015) and Siddiqa et al. (2017) discussed the problems of relational databases, which are a challenge in big data handling: how to process and integrate variety and velocity data. Therefore, the Not only SQL (NoSQL) database has been presented as new technology for designing a data model without strict constraints (Wang et al., 2018). NoSQL is capable of accepting all types of structured, semi- structured, and unstructured data and has many features such as a support-distributed system, a flexible schema, and horizontal, scalable, and easy replication (Storey and Song, 2017; Quattrone et al., 2016). Moreover, NoSQL has a different data model concept that is classified according to the storage and retrieval of data, as each model has different ways of designing, storing, and processing data (Storey and Song, 2017). A document-oriented database is designed to manage and store data in document format and collections. A document’s contents are encapsulated or encoded in a standard format such as extensible markup language (XML), JavaScript Object Notation (JSON), or Binary JavaScript Object Notation (BJSON) for storing and retrieving the data (Li et al., 2014). Each document has a unique primary key. Also, a document can include different data types, such as complex data structure, nested objects, arrays, and embedded documents (Zhao et al., 2013). On the other hand, semi-structured data is emerging as one of the best models for handling large amounts of data. Hashem and Ranc (2016) noted that NoSQL distribution supports a schema that will give flexibility in handling and processing semi-structured data. A semi-structured data format can store data in XML, JSON, or BJSON. Moreover, a document- oriented database stores data in a semi-structured format using the key-value concept. The value of a key can be any data type that gives the database flexibility to store any kind of data (Zhao et al., 2013). 348 Hamouda, S. and Zainol, Z. Semi-Structured Data Model for Big Data (SS-DMBD). DOI: 10.5220/0007957603480356 In Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pages 348-356 ISBN: 978-989-758-377-3 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved