IJSRSET151563 | Received: 09 October 2015 | Accepted: 16 October 2015 | September-October 2015 [(1)5: 290-294]
© 2015 IJSRSET | Volume 1 | Issue 5 | Print ISSN : 2395-1990 | Online ISSN : 2394-4099
Themed Section: Engineering and Technology
290
Detecting Duplicate Records - A Case Study
T. Parimalam, R. Deepa, R. Nirmala Devi, P. Yamuna Devi
Department of Computer Science, Nandha Arts and Science College, Erode, Tamil Nadu, India
ABSTRACT
Databases play an important role in today's IT based economy. Many industries and systems depend on the accuracy
of databases to carry out operations. Therefore, the quality of the information stored in the databases, can have
significant cost implications to a system that relies on information to function and conduct business. Often, in the
real world, entities have two or more representations in databases. Duplicate detection is the process of identifying
multiple representations of same real world entities. The purpose of this paper is to provide a thorough study on
different methods used for detecting duplicate records. And also this paper discussed about the different duplication
detection tools in detail.
Keywords: Database, Duplicate Detection, Records
I. INTRODUCTION
Data quality has become a key issue in computer-based
management systems. Inadequate data causes serious
operational difficulties as well as direct financial losses.
Operational databases store information generated by
business transactions, and this information is used by
management to support business decisions. Data
accuracy assurance is vital, as data is the cornerstone of
a company’s business operations. In addition to serious
implications on decision making, the quality of the data
may affect customer satisfaction, resulting in
unnecessary and possibly high costs to repair damage
caused by low-quality data. In an ideal situation, each
data item should have a global or unique identifier,
allowing these records to be identified, linked, and
related across tables. Unfortunately, this is not the case
in real-life, complex databases. Many organizations have
multiple data collection systems (e.g. Oracle, legacy
systems), and these may differ not only in values or
identifiers, but also in format, structure, and schema of
databases. Additionally, data quality is affected by
human error, such as data entry errors, and lack of
constraints.
When data is entered manually or gathered from
different sources, whether from different systems or
different locations, duplicate records may result.
Describe duplicate records as “all cases of multiple
representations of same real-world objects, i.e.,
duplicates in a data source”. Heterogeneous data often
lacks a global identifier, or a primary key, which would
uniquely identify real-world objects.
II. METHODS AND MATERIAL
A. Data Preparation
Duplicate record detection is the process of identifying
different or multiple records that refer to one unique
real-world entity or object. Typically, the process of
duplicate detection is preceded by a data preparation
stage, during which data entries are stored in a uniform
manner in the database, resolving (at least partially) the
structural heterogeneity problem.
The data preparation stage includes the following steps.
i. Parsing
It locates, identifies and isolates individual data elements
in the source files. Parsing makes it easier to correct,