IJSRSET151563 | Received: 09 October 2015 | Accepted: 16 October 2015 | September-October 2015 [(1)5: 290-294] © 2015 IJSRSET | Volume 1 | Issue 5 | Print ISSN : 2395-1990 | Online ISSN : 2394-4099 Themed Section: Engineering and Technology 290 Detecting Duplicate Records - A Case Study T. Parimalam, R. Deepa, R. Nirmala Devi, P. Yamuna Devi Department of Computer Science, Nandha Arts and Science College, Erode, Tamil Nadu, India ABSTRACT Databases play an important role in today's IT based economy. Many industries and systems depend on the accuracy of databases to carry out operations. Therefore, the quality of the information stored in the databases, can have significant cost implications to a system that relies on information to function and conduct business. Often, in the real world, entities have two or more representations in databases. Duplicate detection is the process of identifying multiple representations of same real world entities. The purpose of this paper is to provide a thorough study on different methods used for detecting duplicate records. And also this paper discussed about the different duplication detection tools in detail. Keywords: Database, Duplicate Detection, Records I. INTRODUCTION Data quality has become a key issue in computer-based management systems. Inadequate data causes serious operational difficulties as well as direct financial losses. Operational databases store information generated by business transactions, and this information is used by management to support business decisions. Data accuracy assurance is vital, as data is the cornerstone of a company’s business operations. In addition to serious implications on decision making, the quality of the data may affect customer satisfaction, resulting in unnecessary and possibly high costs to repair damage caused by low-quality data. In an ideal situation, each data item should have a global or unique identifier, allowing these records to be identified, linked, and related across tables. Unfortunately, this is not the case in real-life, complex databases. Many organizations have multiple data collection systems (e.g. Oracle, legacy systems), and these may differ not only in values or identifiers, but also in format, structure, and schema of databases. Additionally, data quality is affected by human error, such as data entry errors, and lack of constraints. When data is entered manually or gathered from different sources, whether from different systems or different locations, duplicate records may result. Describe duplicate records as “all cases of multiple representations of same real-world objects, i.e., duplicates in a data source”. Heterogeneous data often lacks a global identifier, or a primary key, which would uniquely identify real-world objects. II. METHODS AND MATERIAL A. Data Preparation Duplicate record detection is the process of identifying different or multiple records that refer to one unique real-world entity or object. Typically, the process of duplicate detection is preceded by a data preparation stage, during which data entries are stored in a uniform manner in the database, resolving (at least partially) the structural heterogeneity problem. The data preparation stage includes the following steps. i. Parsing It locates, identifies and isolates individual data elements in the source files. Parsing makes it easier to correct,