Similarity Analysis of Patients’ Data: Bangladesh
Perspective
Shahidul Islam Khan, Abu Sayed Md. Latiful Hoque
Department of Computer Science and Engineering (CSE)
Bangladesh University of Engineering and Technology (BUET)
Dhaka, Bangladesh
nayeemkh@gmail.com; asmlatifulhoque@cse.buet.ac.bd
Abstract— Misspelling of names is a major problem of real
world datasets and a single person is identified differently as its
consequence. In Bangladesh, it is common that many people, in
real, do not know their full name and many of Bangladeshi
citizens are unable to pronounce their name correctly, even in the
mother tongue. The Same person provides a different version of
their name during taking a public service e.g., treatment in
hospital. In almost all healthcare centers, a patient is asked and
he reports his demographic data i.e. name, age, etc. orally. This
creates ambiguity with misspelled names. In this paper, we have
provided an algorithm to identify the same person correctly from
the variation of names. Experimental results show that our
proposed technique can successfully link records with high
accuracy for noisy data like misspelled patient names etc.
Keywords— Record Linkage; Name; Bangladesh; Phonetic
Analysis; Health Data;
I. INTRODUCTION
Data and information have changed our lives and society.
Nowadays around the world tremendous amount of data are
collected from different aspects. Attitudes of peoples and
instruments are also documented. Lots of hidden knowledge
are waiting to be discovered from these data. This is the
challenge of big data era.
Record linkage (also known as identity resolution, data
matching, etc.) refers to the task of finding records in a data
set refer to the same entity from various data sources such as
computer databases, data files, books, and websites. This
linkage or data matching is essential when joining datasets
based on entities that may or may not share a common
identifier such as passport number, health card number,
insurance number, national identity, smart card, or social
security number [1] - [3].
Name of a person plays an important role in identifying the
person and genealogical investigation. On contrary, name
disparity may be a prime predicament for identifying and
searching for people such as web search, security, health
research etc. Variations in names create great difficulties in
identifying people as it is not easy to resolve whether a name
deviation is a different spelling of the same name or a name
for a different person. Variations can be categorized primarily
as a character, spelling, and phonetic variation.
There are nearly 160 million people live in Bangladesh and
230 million people speak in Bengali in the world. Names of
Bangladeshi persons have characteristics different from
European or American names. So separate algorithms should
be developed to address name matching problem of
Bangladeshi citizen.
In this paper, we propose an algorithm that can analyze the
similarities among Bangladeshi names. Experimental results
show that our presented algorithm can successfully identify
the similarity of patients’names in the presence of typical
practical noise e.g., misspelled names. For a noisy health
dataset of 633609 patient records, we achieved 87% correct
name matching.
II. A MOTIVATIONAL EXAMPLE
Bangladesh government took an initiative to develop
National Health Data Warehouse (NHDW) in 2009 with the
help of German Donor GIZ. The objective of the warehouse is
to build an electronic data repository which bridged the gaps
between the various available digital health recordsets and
made them interoperable. Currently, medical data from
different healthcare organizations under Directorate General
of Health Services (DGHS) of Bangladesh Government are
being collected through two open source software: DHIS2 and
OpenMRS. [4]- [7]. A block diagram of the overall system is
depicted in Fig. 1.
Fig. 1. Block Diagram of National Health Data Cloud
This research is supported by the ICT Division, Ministry of Posts,
Telecommunications and Information Technology, Government of the
People's Republic of Bangladesh.
978-1-5090-5421-3/16/$31.00©2016 IEEE
International Conference on Medical Engineering, Health Informatics and Technology (MediTec 2016)
Author Version