Clustering of Web Search Results using Suffix Tree
Algorithm and Avoidance of Repetition of same
Images in Search Results using L-Point Comparison
Algorithm
Manne suneetha
Assistant Professor, Department of Information Technology
Velagapudi Ramakrishna Siddhartha Engineering College
Vijayawada, Andhra Pradesh, India
manne_suni@vrsiddhartha.ac.in
Dr. S Sameen Fatima
Professor, Dept. of Computer Science and Engg.
University College of Engineering, Osmania University
Hyderabad, Andhra Pradesh, India
sameenf@gmail.com
Shaik Mohd. Zaheer Pervez
4/4B.Tech., Department of Information Technology,
Velagapudi Ramakrishna Siddhartha Engineering College,
Vijayawada, Andhra Pradesh
zaheerimpeccable@gmail.com
Abstract—It is a common experience to the web users with the
existing search engines like Google, Yahoo, MSN, Ask, e.t.c., that
the information related to the entered query returns a long
ranked list of results (snippets).It becomes cumbersome to the
user to go through each title, snippet and even sometimes link of
the search results until relevant results are found to the query.
Clustering of search results is a special technique in data mining
using which the retrieved results are organized into meaningful
groups enlightening the user work. This paper deals with the
generalized Suffix tree based clustering approach. The most
repeated phrase in the document tags is considered as cluster
name. Thus in short, web search results that are fetched from the
prevailing web search engines grouped under phrases that
contain one or more search keywords. This paper aims at
organizing web search results into clusters facilitating quick
browsing options to the browser providing an excellent interface
to results precisely. Suffix tree clustering produces comparatively
more accurate and informative grouped results. A basic problem
during image searching in any search engine is Image Repetition.
This can be avoided by using the L-Point Comparison algorithm,
a specially worked out technique in field of Information Retrieval
systems, is also discussed with a practical example.
Keywords- Coherent clustering, Cleaning of Document, Suffix
Tree Based Clustering (STBC), L-point image Comparison (LPC),
Shared phrase
I. INTRODUCTION
Internet is undoubtedly the fastest and easiest mode of
access for unlimited resources of information. But the same
reason is disabling the increasing efficiency of accessing
Information. It is aptly said that Internet is an unorganized,
unstructured and decentralized place of accessing Information
[4].As the web pages are increasing in billions since times
forth, the scientists realized that maintaining web directories
are particularly beneficial to users who are not familiar with
the topics and their relations. Yahoo was the first service to
provide the most complex human made directory of the Web
in the year 2001. However, some results show cross links with
related topics and they do not show the relations between
topics at the same level, rather the topics are sorted
alphabetically or by popularity. However due to the rapidly
growing and unstable characteristic of the web, such
directories often point to outdated ,even not existing
documents. Querying large amount of Search results into
groups of similar data (web directories) is becoming one of the
most complex applications of emerging Web applications.
It is not only posing challenges in the field of Data mining
but also in the areas of Information Retrieval Systems and in
Data warehousing. The term clustering deals with grouping
the number of similar kinds of data in respect to related
phrases. The search result clustering mechanisms that have
been investigated has seriously confronted with the drawbacks
like clustering labels screening, cluster quality assessment and
overlapping clusters controlling. It has been a furiously
investigating topic for the developers to check it out which is
the best Clustering algorithm opting with reference to less
time complexity and Multilingual Clustering features. Web
search result clustering based on suffix tree clustering
algorithm is a promising approach to work on a long list of
snippets returned by search engines. The original STC
algorithm can often construct a long path of suffix tree,
particularly when the same snippets are feed to the STC
algorithm [5]. The modeling and analysis section throws light
on the structure and designing aspects of suffix tree while the
PROCEEDINGS OF ICETECT 2011
978-1-4244-7925-2/11/$26.00 ©2011 IEEE 1041