Indonesian Journal of Electrical Engineering and Computer Science Vol. 13, No. 2, February 2019, pp. 492~498 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v13.i2.pp492-498 492 Journal homepage: http://iaescore.com/journals/index.php/ijeecs Focused crawling from the basic approach to context aware notification architecture Venugopal Boppana, Sandhya P School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus, India Article Info ABSTRACT Article history: Received Jul 7, 2018 Revised Oct 4, 2018 Accepted Nov 18, 2018 The large and wide range of information has become a tough time for crawlers and search engines to extract related information. This paper discusses about focused crawlers also called as topic specific crawler and variations of focused crawlers leading to distributed architecture, i.e., context aware notification architecture. To get the relevant pages from a huge amount of information available in the internet we use the focused crawler. This can bring out the relevant pages for the given topic with less number of searches in a short time. Here the input to the focused crawler is a topic specified using exemplary documents, but not using the keywords. Focused crawlers avoid the searching of all the web documents instead it searches over the links that are relevant to the crawler boundary. The Focused crawling mechanism helps us to save CPU time to large extent to keep the crawl up-to-date. Keywords: Complex event processing Focused crawler Topic specific crawler Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Venugopal Boppana, School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus, Chennai, India. Email: srees.boppana@gmail.com 1. INTRODUCTION In recent days most of the latest information is available for us from the internet. But the greatest challenge is to get the relevant information for the given topic. This can also lead to extracting the irrelevant information from the web. This type of extraction, i.e., extracting both relevant and irrelevant data is done by the classical crawler. This lead to wastage of CPU time, memory and resources to large extent. The breadth first mechanism is followed by the classical crawler which searches all the links of a single parent. That possible links may consist of irrelevant data along with the relevant data. To resolve the above challenges like time, space, resources and irrelevant data, topic specific crawler or focused crawlers are designed and introduced. These are much better than classical crawler in producing accurate data for the given topic. This topic specific crawler avoids the searching of the entire web, instead searches only specific area of the web. This crawler follows the mechanism of depth first search. The working of focused crawler is divided into two steps. In the first step irrelevant data is separated from the relevant data and the second step is selecting the seed page URL which helps in finding the next child nodes, i.e., next links for the relevant pages. The focused crawler helps in reducing the time to crawl, memory to store the crawled pages or to store the visited pages, decreases irrelevant data. This gives the great improvement over the classical crawler. The classical focused crawlers and the learning focused crawlers are the two sub crawlers of the focused crawler. The classical focused crawlers are given with the predefined set of rules to pick the relevant pages for the given topic. Learning crawler updates the crawling link by learning from the training set. This training set is updated regularly.