ORIGINAL PAPER MapReduce based integration of health hubs: a healthcare design approach Ramesh Dharavath 1,2 & Samuel Nyakotey 1 & Damodar Reddy Edla 2 Received: 23 January 2019 /Accepted: 29 March 2019 # IUPESM and Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract The increasing population in Asia brings up the need for integration of healthcare for efficient and timely manageable treatment for different diseases. Healthcare domain is one of the most important and challenging fields in terms of data collection and analysis. This domain always provide lots of opportunities to explore the hidden knowledge in accessing health records. With the growth of unstructured data in large volume that leads towards the solution by the NoSQL data management tool to manage the huge amount of data. This framework proposes a MapReduce Approach (MRA) for data management in healthcare industry with join based expectation maximization algorithm for NoSQL data management solution, which scales the data with accurate modality. This approach also simplifies the way to integrate healthcare data from different models in the distributed environment from different health hubs. Experimental results show that the proposed approach works in a scalable manner to integrate and match the unstructured data of different health data sources. Examples are illustrated with suitable methodology and further research scope is pinpointed. Keywords NoSQL database . MapReduce . Expectation maximization . HDFS . Health data 1 Introduction Business experts of healthcare industry always look for an effective management system to manage the required infor- mation as per the market demand to provide quality of service (QoS). For this, many of the experts rely on the relational database management system (RDBMS) which performs ef- fective retrieval functionality of the information. But, with the growth of large amounts of unstructured data in healthcare industry gives a rise to look for a new cost effective and efficient management systems. This can be managed efficient- ly by distributing ecosystem like Hadoop, which quantifies distributed functionalities with storage and provides a cost effective solution for managing data in different forms from different sources. This is termed as Big Data. In terms of smart health hubs which are spread across different geographical locations, a Big Data problem arises when we consider differ- ent health hubs to give their hospital management information for analyzing the result. This may create confusion while in- tegrating the same entity of different health hubs which do not have the same and consistent schema defined on their relation- al management system [1, 2]. It also requires effective joining and integration mechanism to manage the structured, semi- structured and unstructured data to get the desired result to predict and assist the medical diagnosis. In order to manage and provide cost efficient solution to the above addressed problems, in this framework, a mapreduce algorithmic ap- proach has been proposed to join and integrate various data models of the healthcare domain. To obtain the similar result which is produced by RDBMS for structured data and for some cases where data are not present in structured manner it uses the generalized logic in terms of NoSQL database to give the desired result [3]. To integrate and match the common attributes of different entities spread across over different * Ramesh Dharavath drramesh@iitism.ac.in Samuel Nyakotey snyakotey@gmail.com Damodar Reddy Edla dr.reddy@nitgoa.ac.in 1 Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, Jharkhand 826004, India 2 Department of Computer Science and Engineering, National Institute of Technology, Farmagudi, Goa 403401, India Health and Technology https://doi.org/10.1007/s12553-019-00321-8