www.ijcrt.org © 2018 IJCRT | Volume 6, Issue 2 April 2018 | ISSN: 2320-2882
IJCRT1892047 International Journal of Creative Research Thoughts (IJCRT) www.ijcrt.org 306
GSA: A GLOBAL FRAMEWORK FOR
SIMILARITY SEARCHING
Anima Srivastava
1
, Manish Jaiswal
2
, Arpita Tewari
3
1
Department of Electronics & Communication, University of Allahabad, Allahabad, India
2
Department of Electronics & Communication, University of Allahabad, Allahabad, India
3
Department of Electronics & Communication, University of Allahabad, Allahabad, India
Abstract : With the advancement in technology searching and machine learning is believed to a good technique for
measuring documents similarity and prediction accuracy for plagiarism detection. The most popular searching algorithm is
either the industrial or the academic environment is RankBrain algorithm. This paper proposed an improved framework of
searching with machine learning which masters the complexity of searching accurate matches. An empirical evaluation of
the proposed approach is given based on its objective and case study. We describe a novel functional framework based on
searching algorithm with machine learning both for differentiating intent of query and generate content semantically. We
explore and analyze various well-known Google’s searching algorithm in terms of their effectiveness toward similarity
searching and best matching.
Index Terms - searching, document similarity, RankBrain
I. INTRODUCTION
Machine learning algorithms [16] are one of the powerful techniques to measure similarity of documents by
versatile methods. This paper is carrying the actions of different similarity based machine learning algorithm
but emphasizes on RankBrain i.e. the new way to design and find improved search ranking and quality. This
work shows a transparent comparative study of similarity detection having its efficiency and deficiency in
complete manner with analysis. The rest of the paper is organized as follows; Section 1 contains the
introductory explanations of the work, Section 2 describes the brief knowledge of the several prominent
contemporary searching algorithms, whereas the section 3 highlights the literature review of related searching
[11] aspects and algorithms, section 4 stated clearly the detail of proposed framework; section 5 measuring the
efficiency and applicability of the proposed framework; finally the section 6 includes the conclusion.
II. SEARCHING ALGORITHMS FOR SIMILARITY DETECTION
Each searching algorithm [6] has multiple parameters and searching criteria to detect similarity and retrieve optimum
outcomes. Some of most popular Google’s searching algorithms [7] are discussed in the following:
2.1Panda
Panda [7] is a searching algorithm used to assign grade for web pages which is based on subject’s quality and also
settle on down rank of websites with their quality content. Panda works like a strainer instead d of Google’s other
searching algorithm. Basically it is integrated into the ranking algorithm and used for de-rank sites with low quality
content, it doesn’t utilize in real time search but filtering and retrieving results from updated version of Panda is much
more faster than the older one.
2.2Pigeon
Pigeon is Google’s searching algorithm released with the two key factors i.e. distance and location. Pigeon is
available for searches result in English only. The query is based on searcher’s location because it significantly drops
in the number of queries used to rank local and non local result returned. It uses local directory sites for providing
excellent result. Goggle map and Google web search consistently used by pigeon for relevant local search results.
2.3 Penguin
The main objective of penguin [7] is to detect and de-rank sites with unsolicited, anomalous link outlines. By using
devious tactics it operates in real time hence correction and revival takes less time penguin is just a segment of
Google’s main ranking algorithm.
2.4 Pirate
Google’s pirate was invented inhibit and de-rank those sites that have many copyright encroachment reports.
Nowadays popularly know sites are involved in making plagiarize content e.g. video clips, songs, movies etc. for