Indonesian Journal of Electrical Engineering and Computer Science
Vol. 13, No. 2, February 2019, pp. 492~498
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v13.i2.pp492-498 492
Journal homepage: http://iaescore.com/journals/index.php/ijeecs
Focused crawling from the basic approach to context aware
notification architecture
Venugopal Boppana, Sandhya P
School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus, India
Article Info ABSTRACT
Article history:
Received Jul 7, 2018
Revised Oct 4, 2018
Accepted Nov 18, 2018
The large and wide range of information has become a tough time for
crawlers and search engines to extract related information. This paper
discusses about focused crawlers also called as topic specific crawler and
variations of focused crawlers leading to distributed architecture, i.e., context
aware notification architecture. To get the relevant pages from a huge
amount of information available in the internet we use the focused crawler.
This can bring out the relevant pages for the given topic with less number of
searches in a short time. Here the input to the focused crawler is a topic
specified using exemplary documents, but not using the keywords. Focused
crawlers avoid the searching of all the web documents instead it searches
over the links that are relevant to the crawler boundary. The Focused
crawling mechanism helps us to save CPU time to large extent to keep the
crawl up-to-date.
Keywords:
Complex event processing
Focused crawler
Topic specific crawler
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Venugopal Boppana,
School of Computing Science and Engineering,
Vellore Institute of Technology, Chennai Campus, Chennai, India.
Email: srees.boppana@gmail.com
1. INTRODUCTION
In recent days most of the latest information is available for us from the internet. But the greatest
challenge is to get the relevant information for the given topic. This can also lead to extracting the irrelevant
information from the web. This type of extraction, i.e., extracting both relevant and irrelevant data is done by
the classical crawler. This lead to wastage of CPU time, memory and resources to large extent. The breadth
first mechanism is followed by the classical crawler which searches all the links of a single parent. That
possible links may consist of irrelevant data along with the relevant data.
To resolve the above challenges like time, space, resources and irrelevant data, topic specific
crawler or focused crawlers are designed and introduced. These are much better than classical crawler in
producing accurate data for the given topic. This topic specific crawler avoids the searching of the entire
web, instead searches only specific area of the web. This crawler follows the mechanism of depth first search.
The working of focused crawler is divided into two steps. In the first step irrelevant data is separated from the
relevant data and the second step is selecting the seed page URL which helps in finding the next child nodes,
i.e., next links for the relevant pages. The focused crawler helps in reducing the time to crawl, memory to
store the crawled pages or to store the visited pages, decreases irrelevant data. This gives the great
improvement over the classical crawler.
The classical focused crawlers and the learning focused crawlers are the two sub crawlers of the
focused crawler. The classical focused crawlers are given with the predefined set of rules to pick the relevant
pages for the given topic. Learning crawler updates the crawling link by learning from the training set. This
training set is updated regularly.