ISSN 2320-5407 International Journal of Advanced Research (2016), Volume 4, Issue 2, 1224-1232 1224 Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE Automation of resolving CAPTCHAs for web crawling. Renuka Sakhare 1 , Abhay Bhagat 2 and Anil Bhadagle 3 . 1. Student, PVG‟s college of engineering, Pune. 2. Student, PVG‟s college of engineering, Pune. 3. Assistant professor, department of computer engineering, PVG‟s college of engineering, Pune. Manuscript Info Abstract Manuscript History: Received: 14 December 2015 Final Accepted: 26 January 2016 Published Online: February 2016 Key words: Web crawling, CAPTCHAs, text area detection, image segmentation, character recognition. *Corresponding Author Renuka Sakhare. A web crawler is an automated computer program used by search engine to collect data of web pages from World Wide Web and the web crawler perform this by process called web crawling. To keep data updated crawler need frequent caching of web pages. But performance of web server gets affected as crawler retrieve data frequently in greater depth than human searchers. Thus to balance load and for authentication server asks crawler to verify themselves against CAPTCHAs. It is not feasible for human to solve and enter CAPTCHAs for more than two billion web pages exist on www every now and then. Thus to automate CAPTCHA solving, we describe a system for text recognition from CAPTCHA images. Our particular focus is reliable text extraction, recognition, feeding resolved CAPTCHA characters to crawler system in order to continue with crawling process without human involvement. Copy Right, IJAR, 2016,. All rights reserved. Introduction:- The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present [1]. Due to the abundance of data on the web and different user perspective, information retrieval becomes a challenge. When a data is searched, hundreds and thousands of results appear. So for user‟s convenience the search engines have a bigger job of sorting out the results, in the order of interestingness of the user within the first page of appearance and a quick summary of the information provided on a page. Web crawlers are programs which browse www in methodical and automated way. Web crawler creates copy of all the visited pages for later processing by a search engine. This process is iterative, as long the results are in close proximity of user‟s interest. Search engines use this algorithms which sorts and ranks the result in the order of authority that is closer to the user‟s query. Many algorithms are is in use - Breadth first search, Best first search, Page Rank algorithm, Genetic algorithm, Naïve Bayes classification algorithm to mention a few [6]. There are important characteristics of the Web such as its large volume, dynamic page generation that make crawling very difficult. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. Performance of web crawler based on freshness and age. When the same copy exists in the local as well as the remote sources, then it is considered to be the “fresh” page. Cho and Garcia [6] calculated the fre shness of a page as shown in figure 1, Figure 1