International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-1, May 2020 Retrieval Number: A1463059120/2020©BEIESP DOI:10.35940/ijrte.A1463.059120 Published By: Blue Eyes Intelligence Engineering & Sciences Publication 139 IDENTIFICATION OF WEB SITE RELIABILITY THROUGH DATA SCRAPPING AT WEB CRAWLER’S NAVIGATION S. Ponmaniraj, Tapas Kumar, Amit Kumar Goel Abstract: Searching a specified content on the web site is like epistle a single character in bunch of pages. When the user enters their keyword into any search engines, it takes that in to web server mining process for collecting the entire terms related to that entered key phrase. Few pages gives legal and authenticated matter for the user, which they really wanted to access. Whereas many other pages are bringing them some unwanted and malicious codes of pages or virus activity pages to harm user’s activities and the system’s functions. Generally a web page attacks the targeted system by faulty instructions and malevolent programs through some sort of intrusion methodologies are called as phishing. In this attacking method user is set to access unknown or illegal sites by the way of accessing some unidentified websites link imbedding with legal site contents. Once victim’s system performance got compromised then hackers started to do attack. To avoid this kind of molestations, user needs to understand reliability of web page’s contents before started to continue browsing. This research paper is going to present web crawler architecture, design complexities and implementation for scrapping web contents from visited web pages for indentifying their reliability and freshness. Keywords: Intrusion Detection System, Parser, Scanner, Search Engine Optimization, Semantic Web, Unstructured Information Management Architecture, Web crawler, Web Robot. I. INTRODUCTION In this internet era every consumer wants to access millions of web pages for a single query passed on search engines. Search engines using more optimization techniques to bring the exact content to the victims still by analyzing the keyword phrase it is in the state of producing more output pages to the user. For an example if a client machine passing the term “Crawler” in a google search engine then it brings about more than 2,57,00,00,000 sites with searched content. Same keyword passed in to yahoo search engine then it bringing more than 19,400,000 numbers of web sites and if that keyword is searched at duckduckgo search engine then it produced the result pages in terms of 34 billion web site count. Revised Manuscript Received on April 15, 2020 S. Ponmaniraj *, Research Scholar, School of Computing Science and Engineering, Galgotias University, Uttar Pradesh, India, ponmaniraj@gmail.com Dr. Tapas Kumar , Professor, School of Computing Science and Engineering, Galgotias University, Uttar Pradesh, India, tapas.kumar@galgotiasuniversity.edu.in Dr. Amit Kumar Goel, Professor, School of Computing Science and Engineering, Galgotias University, Uttar Pradesh, India, amit.goel@galgotiasuniversity.edu.in From these enormous counting of web sites few sites only legally authorizing the original data to hold and others are simply holding the key terms and coming into the place to increase site counts[1] [18]. Some of these pages are maintaining by the hackers to violet the victims systems information and functions of the systems. On the internet environment, any user can access legal content on authorized web sites. Several sites are tied up with some other corporate for advertising their business. So that legal sites allow them to embed their trade mark logo or advertising image along with their company URL. Unknowingly once this logo or image got click action then automatically user is directed to the targeted web sites which user doesn’t want to look for. If the directed page is a legal then there will not be any issue for the victim’s data otherwise that link will find the loop hole to make users sensible data or the systems to get compromise with their security where hackers can play well their attacking process. In general any unauthorized or unwanted activities happening at the legal sites by illegal links in the form of any interruptions like asking users to click on some buttons, to follow some links unnecessarily, make users to accept something or by posting some spam like images and videos are called Intrusions. Following chapter contents will extract the process taken on web searching and crawling. II. RELATED WORKS Andas Amrin et.al, presented their views on Fish algorithms to identify the web contents by the way of accessing its score values. At the time of visiting every URL it update the first link and then the next linked URLs are processed by ranking their value, for relevant (1) and non-relevant (0). Depth of the related URL (Child) assigned with predetermined value in the list otherwise URL will be dropped. Their implementations on searching works like a browsing with optimized stratagems. It is faster than other algorithm to set parental and child URLs. In this model downloading web documents from WWW is consuming more time. During the progress it creates high traffic due to accessing network resources and the hidden web crawling is impossible [2]. Herseovici M et.al [3], and Lei Luo et.al [4], developed an efficient algorithm named (Adaptive) shark searching algorithm to give remedial actions