International Journal of Computer Applications (0975 8887) Volume 160 No 1, February 2017 32 A Novel SSPS Framework for String Similarity Join P. Selvaramalakshmi Research Scholar, Department of Computer Science Bishop Heber College, Trichy, TamilNadu, India. S. Hari Ganesh, PhD Assistant Professor, Department of Computer Science, H. H. The Rajah’s College. Pudukottai, TamilNadu, India Florence Tushabe, PhD Associate Professor, UTAMU, Kampala, Uganda, East Africa. ABSTRACT As the enormous growth of information challenges the existing string analysis techniques for processing huge volume of data, there always seem to be a hope for newer inventions. Moreover, the problems encountered with the traditional methods such as low pruning power, increased false positives and poor scalability should be addressed with the appropriate solutions that cater to the need for improvement. Hence, this paper aims at proposing an improved similarity joins using SSPS MapReduce Framework that consists of a novel PSS Stemming algorithm along with three newly proposed filtering techniques such as SSize, SPositional and UI(Union Intersection) that could effectively process large scale data by concerning the limitations of the traditional filtering methods. The experimentation shows that the framework reduces the false positives and run time cost with increased scalability than the existing frameworks. Keywords similarity joins, Hadoop, MapReduce, filtering and Verification 1. INTRODUCTION Similarity Join is considered as one of the vital tasks in data cleansing and integration that is intended to find the similar pairs of strings from two sets or collections of documents. Thus, offers wide range applications including duplicate detection [1] [2] [3] [4], data cleaning [5] [6], plagiarism detection [7], record linkage [8] and string searching [9] [10]. The traditional methods of string similarity employ a well- known filter verification framework that embraces two essential steps of filter and verification. The filter extracts the candidate pairs by pruning the large number of dissimilar pairs and the verification retrieves the original similarity of documents by thoroughly evaluating each candidate pair in which the filter requires an intensive care it plays a vital role in the framework. The typical way of classifying the string similarity is either character or token based metrics [11]. As the intension of this research work is to propose several filtering approaches to process. The token-based filtering approaches have been studied. The metric first converts the strings into token sets and applies the set-based similarity such as Jaccard and Cosine Similarity measures to quantify the similarity [12]. The filtering techniques are also classified according to the types of similarity measures. The state of the art of string similarity join lies under effective modification of filtering techniques w.r.t. similarity metrics which is the influencing factor of this research. Hence, the preceding section presents the recently proposed filtering techniques and their merits and demerits. The remain sections of the paper is organized as follows: Section 2 deals with the recent literature on filtering techniques, section 3 discusses the SSPS framework and the research contributions of the paper, section 4 describes the experimentation and result discussions and finally, section 5 concludes the findings of the paper. 2. LITERATURE REVIEW ON FILTERING TECHNIQUES 2.1 Count Filtering (CF) The basic notion of CF is that if two strings are similar, if and only if they share at least C common signatures which implies that the number of shared signatures between two strings which is smaller than C is the string pair that can be pruned. The method takes each token as signature and sets an overlap threshold as common signatures C=. Two strings „j‟ and „m‟ are similar w.r.t the overlap similarity can be denoted using the equation C=(|j∩m|)/(|j|+|m|-|j∩m|)≥γ (1) If the length of the signature is increased, there could only be fewer strings sharing a common signature causing the inverted lists to be shorter. Thus it may decrease the time taken to merge the inverted lists. In contrast, a lower threshold on the number of common signatures shared by similar strings causes a less selective count filter to eliminate dissimilar string pairs [13]. The number of false positives after merging the lists will increase, causing more time to compute their common signatures in order to verify if they are in the answer to the query. 2.2 Length Filtering (LF) The length of string may also be considered as one of the joining constraints as the similar strings can be represented with same length. Thus, LF concerns with the pruning of dissimilar pairs w.r.t length difference which means, if two strings are similar, then their difference in length cannot be large than [14]. Two strings „j‟ and „m‟ are similar w.r.t LF can be denoted using the equation γ|j|≤|m|≤(|j|)/γ (2) LF is attained by partitioning the strings into group of strings of same length. The pruning of two groups of string is done when the length of the strings are dissimilar. LF increases the join cost and false positives which would in causes low pruning power which affects the scalability. 2.3 Prefix Filtering (PF) PF sorts the tokens in an ordered sequence of list such as alphabetical or inverse document frequency and compares the first set of prefix signatures within the strings based on the fact, if two strings „j‟ and „m‟ are similar then the prefix order of the sequence is also similar [15]. Given the overlap threshold for each string „j‟ the PF „jp‟ is calculated using the equation