Intelligent Crawler Engine On Cloud Computing Infrastructure Pratibha Ganapati Gaonkar M. Tech (Dept. of CS&E) Bapuji Institute of Engineering and Technology Davangere, India prathibha.kwr@gmail.com Dr. Nirmala C R Professor and HOD (Dept. of CS&E) Bapuji Institute of Engineering and Technology Davangere, India crn@bietdvg.edu Abstract—This paper is aimed to implement an intelligent crawler engine on cloud computing infrastructure. This approach uses virtual machines on a cloud computing infrastructure to run intelligent crawler engine. The use of Virtual Machine (VM) on this architecture with help for easy setup/installation, maintenance or VM terminating that has been running with some particular crawler engine as needed. With this infrastructure, we have designed an intelligent crawler by making use of Naive Best First algorithm and R-Spam Rank algorithm, which is more efficient compared to the earlier crawlers as per the result and analysis. In order to accomplish this task Amazon public cloud is used with its services, S3 and EC2. Keywords—Cloud computing infrastructure; Intelligent Crawler engine;Elastic Compute Cloud(EC2); Simple Storage Service(S3). I. INTRODUCTION In the world of Web 2.0, the adage “content is king” remains a prevailing theme. With seemingly endless content available online, the “findability” of content becomes a key factor. Search engines are the primary tools people use to find information on the web. Searches are performed using keywords. When you enter a keyword or phrase, the crawler engine finds matching web pages and show you a search engine results page (SERP) with recommended web pages listed and sorted by relevance. Though it used to be difficult to obtain diverse content, there are now seemingly endless options competing for an audience’s attention. As a result, search engines have gained popularity by helping users quickly find and filter the information they want. Google, Yahoo, Bing and Ask have emerged as the most popular search engines in the recent past. Most users have formed searching habits to gain the information they need, as there is no single website that caters to all their needs. Google logs an estimated 2 billion searches per day and an estimated 300 million users use the search facility provided by Google on a daily basis[1]. Web crawlers are the programs or software that uses the graphical structure of the Web to move from page to page[2]. Such programs are also called wanderers, robots, spiders, and worms. Web crawlers are designed to retrieve Web pages and add them or their representations to local repository/databases. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages that will help in fast searches. Web search engines work by storing information about many web pages, which they retrieve from the WWW itself. These pages are retrieved by a Web crawler, which is an automated Web browser that follows every link it sees. A crawler for a large search engine has to address two issues. First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have a highly optimized system architecture that can download a large number of pages per second while being robust against crashes, manageable, and considerate of resources and web servers[3][4]. Cloud computing is considered to be a new computing paradigm. However, recently it has gained a lot of attention due to its usefulness in terms of cost, adaptability, variety of services, and computational support to the devices with less computational power [3]. Though some people are worried about the security in the cloud. But in fact, the cloud is more secure than the proprietary infrastructure. As in the cloud computing, we outsource the computation, not the control[4]. Cloud computing can really be a winsome option for an enterprise. Especially for the new enterprises, which want to reduce the upfront cost for their computing infrastructure. Even established organizations can reduce not only the computing infrastructure cost, but also the administrative and operational cost for the infrastructure. Because after purchasing the computing infrastructure, the organization needs human resources, space, energy and many other resources to manage and administer them. Whereas, in the case of opting for cloud computing services, these costs are reduced[5]. Some of the cloud infrastructure/service providers are Amazon[9] , Salesforce , Google App Engine and Microsoft Azure . The major users of these cloud providers are the enterprises. In the near future, cloud services will be widely used by the enterprises and individuals, using hybrid computing and communication devices. Thus it is required to provide cloud service to the individuals, at a very low cost. It can be possible by creating competition among cloud vendors and reducing infrastructure cost for them. For this purpose, we propose a cloud computing model (i.e. Virtual Cloud) to achieve the low cost objective. Virtual cloud model is mainly aimed to reduce cost for both the cloud user and cloud vendors. Cloud computing claimed that it provides better efficiency in the use of infrastructure [6]. In addition, cloud computing technology has been developed so rapidly and could change the implementation or 56 International Journal of Engineering Research & Technology (IJERT) www.ijert.org NCRTS`14 Conference Proceedings ISSN: 2278-0181