www.theinternationaljournal.org > RJSITM: Volume: 05, Number: 08, June -2016 Page 1 Classification of the Approaches of the Near Duplicate Document Detection and Elimination Kavita Garg School of Computer and Information Sciences, MVN University, Palwal, India & Dr. Saba Hilal http://sabahilal.blogs pot.in/ & Dr. Jay Shankar Prasad, School of Computer and Information Sciences, MVN University, Palwal Abstract: The area of identification and removal of near duplicate document is important to research as near duplicate pages increase overhead on web. It increases storage space and indexing cost, crawlers produces same type of results and make searching process ineffective. Identical pages on web exists because it contains huge volume of records. Many researches have been done in this area to solve the problem but the problem still exists. The researchers have studied the problem from different perspectives and tried to formulate solutions, however the problem is intensifying as new pages are added to the web. This paper studies previous research work and classify those algorithms and approaches with the intention of structuring the area of duplicate document finding. Keywords : Near duplicate document; URL normalization; Shingling. 1. Introduction 1.1 Identification and removal of near duplicate document The area of identification and removal of near duplicate documents defines many approaches and algorithms which are described below. 1.1.1 Hash Index and page size: To identify near duplicate based on hash index and page is described in [1].This approach accept input from user and calculate hash index and page size and the calculate values are stored in hash table with pointer to first data source. One drawback of this approach is that if formatting changes or word order changes then the results are not considered duplicate even they contains same content [1]. 1.1.2 Instance level constrained clustering Fingerprint approach and word comparison are not sufficient for finding near duplicate document .In [2], it states a new approach based on instanced level clustering. /instance level clustering is a semi supervised process.It form near duplicate clusters using information such as document attributes and content structure in the clustering process. It incorporates constraints on document attributes and content structures. The instance level constraint is based on document attributes and is the main component of this process. 1.1.3 Fusion of Algorithm o Fusion of state of art algorithms o Efficient approach using clustering, sentence Feature and fingerprinting. o Fusion of “state of art” algorithms In [3], three algorithms are merged to give better result in the field of identification of near duplicate documents .It fuses shingling algorithm, I-Match, and simhash to give better outcome. First, it takes sequence of words (shingles) from document, after that it imports fingerprints of the document into shingling based simhash algorithm.