Similarity Analysis of Patients’ Data: Bangladesh Perspective Shahidul Islam Khan, Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering (CSE) Bangladesh University of Engineering and Technology (BUET) Dhaka, Bangladesh nayeemkh@gmail.com; asmlatifulhoque@cse.buet.ac.bd AbstractMisspelling of names is a major problem of real world datasets and a single person is identified differently as its consequence. In Bangladesh, it is common that many people, in real, do not know their full name and many of Bangladeshi citizens are unable to pronounce their name correctly, even in the mother tongue. The Same person provides a different version of their name during taking a public service e.g., treatment in hospital. In almost all healthcare centers, a patient is asked and he reports his demographic data i.e. name, age, etc. orally. This creates ambiguity with misspelled names. In this paper, we have provided an algorithm to identify the same person correctly from the variation of names. Experimental results show that our proposed technique can successfully link records with high accuracy for noisy data like misspelled patient names etc. Keywords— Record Linkage; Name; Bangladesh; Phonetic Analysis; Health Data; I. INTRODUCTION Data and information have changed our lives and society. Nowadays around the world tremendous amount of data are collected from different aspects. Attitudes of peoples and instruments are also documented. Lots of hidden knowledge are waiting to be discovered from these data. This is the challenge of big data era. Record linkage (also known as identity resolution, data matching, etc.) refers to the task of finding records in a data set refer to the same entity from various data sources such as computer databases, data files, books, and websites. This linkage or data matching is essential when joining datasets based on entities that may or may not share a common identifier such as passport number, health card number, insurance number, national identity, smart card, or social security number [1] - [3]. Name of a person plays an important role in identifying the person and genealogical investigation. On contrary, name disparity may be a prime predicament for identifying and searching for people such as web search, security, health research etc. Variations in names create great difficulties in identifying people as it is not easy to resolve whether a name deviation is a different spelling of the same name or a name for a different person. Variations can be categorized primarily as a character, spelling, and phonetic variation. There are nearly 160 million people live in Bangladesh and 230 million people speak in Bengali in the world. Names of Bangladeshi persons have characteristics different from European or American names. So separate algorithms should be developed to address name matching problem of Bangladeshi citizen. In this paper, we propose an algorithm that can analyze the similarities among Bangladeshi names. Experimental results show that our presented algorithm can successfully identify the similarity of patients’names in the presence of typical practical noise e.g., misspelled names. For a noisy health dataset of 633609 patient records, we achieved 87% correct name matching. II. A MOTIVATIONAL EXAMPLE Bangladesh government took an initiative to develop National Health Data Warehouse (NHDW) in 2009 with the help of German Donor GIZ. The objective of the warehouse is to build an electronic data repository which bridged the gaps between the various available digital health recordsets and made them interoperable. Currently, medical data from different healthcare organizations under Directorate General of Health Services (DGHS) of Bangladesh Government are being collected through two open source software: DHIS2 and OpenMRS. [4]- [7]. A block diagram of the overall system is depicted in Fig. 1. Fig. 1. Block Diagram of National Health Data Cloud This research is supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology, Government of the People's Republic of Bangladesh. 978-1-5090-5421-3/16/$31.00©2016 IEEE International Conference on Medical Engineering, Health Informatics and Technology (MediTec 2016) Author Version