International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611 Volume 4 Issue 11, November 2015 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Removing Dedepulication Using Pattern Serach Suffix Arrays Pratiksha Dhande 1 , Supriya Kumari 2 , Sushmita Tupe 3 , Laukik Shah 4 Department of Computer Science Engineering, Savitribai Phule Pune University, G.H.R.I.E.T, Wagholi, Pune, Maharashtra, India Abstract: With the increase of de-duplication in data sets of voter card or pan card, removing the de-duplication is the major challenge. Record linkage is the process of matching records from several databases that refer to the same entities. When appliedon a single database, this process is known as de-duplication. In this paper the investigation is done to how to remove the de-duplication with the help of suffix arrays.Suffix array is well organized data structure for pattern searching. This paper covers similarity metrics that are commonly used to spot similar field entries, and present a widespread set of duplicate detection algorithms that can identify almost duplicate records in a database. It also covers multiple techniques for improving the effectiveness and scalability of estimated duplicate detection algorithms.Finally, based on the algorithms, the paper presents how to remove the de-duplication from dataset. Keywords: String search, pattern matching, suffix array, suffix tree 1. Introduction Databases play an important role in today’s’ world. Many different sectors depend on the correctness of databases to carry out operations. Hence, the quality of the data stored in databases can have major implications on the system. An essential step in integrating data from different sources is to identify and eliminate duplicate records that refer to the same entity. This process is known as De-duplication. String search is a well known problem: given a textA[0 . . . m−1] over some alphabet Σof size s=|Σ| and a pattern Q[0 . . .k− 1], locate the occurrences of Q in A. Several different query modes are possible: whether or not Q occurs (existence queries); how many times Q occurs (count queries); how many byte locations in A at which Q occurs (locate queries); and a set of extracted contexts of A that includes each occurrence of Q(context queries) [1].When A and Q are provided on a one-off basis, sequential pattern search methods take O(m+ k) time. When A is fixed, and many patterns are to be processed, it is likely to be more efficient to pre-process A and construct an index. The suffix array is one such index, allowing locate queries to be answered in O(k+ log m+ y) time when there are y occurrences of Q in A, using O(mlog m) bits of space in addition to A. But suffix arrays only provide efficient querying if A plus the index require less main memory than is available on the host computer, because multiple accesses are required to both. For large texts, two-tier structures are needed, with an in-memory component consulted first in order to identify the data that must be retrieved from anon- disk index. As many businesses, government agencies and research projects collect increasingly large amount of data, techniques that allow efficient processing, analyzing and mining of such massive databases have in recent years attracted interest from both academic and industry[2]. One task that has been recognized to be of increasing importance in many application domains is the matching of records that relate to the same entities from several databases. Often, information from multiple sources needs to be integrated and combined in order to improve data quality, or to enrich data to facilitate more detailed data analysis. The records to be matched frequently correspond to entities that refer to people, such as clients or customers, patients, employees, taxpayers, students, or travelers. 2. Literature Survey De-duplication is necessary is for the construction of web portals which combine data from different pages possibly created in a distributed method by millions of people. The key challenge in this task is to find a function that can determine when two records refer to the identical entity in spite of errors and conflicts in the data. One duty that has been recognized to be of growing importance in many application domains is the matching of records which arecount to the same entities from numerous databases. Many businesses practice de-duplication and record linkage techniques with the objective to de-duplicate their databases to increase data quality or compile mailing lists, or to match their data through organisations, for example for collaborative marketing and e-Commerce projects. Various government organizations are now ever more employing record linkage, for instance within and amongst taxation offices and departments of social security to recognize people who register for assistance many times, or who work and gather unemployment benefits. This isat present not clear which indexing technique is appropriate for what type of data and what kind of record link or de- duplication application. This practice has recently been planned as an efficient domain liberatedapproach to multi- source information combination. The simple idea is to insert the BKVs(Bounded Key Value) and their suffixes into a suffix array based inverted index. A suffix array holds strings or sequences and their suffixes in an alphabetically arranged order. Indexing based on suffix arrays has effectively been used for both English and Japanese databases.One of the finest studied particular cases is edit distance, which permits to delete, insert and replace characters(by a unlike one) in both strings. If the dissimilar operations have dissimilar cost or the cost depend Paper ID: NOV151363 1217