International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-9 Issue-1, May 2020
Retrieval Number: A1463059120/2020©BEIESP
DOI:10.35940/ijrte.A1463.059120
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication 139
IDENTIFICATION OF WEB SITE RELIABILITY
THROUGH DATA SCRAPPING AT WEB
CRAWLER’S NAVIGATION
S. Ponmaniraj, Tapas Kumar, Amit Kumar Goel
Abstract: Searching a specified content on the web site is
like epistle a single character in bunch of pages. When the
user enters their keyword into any search engines, it takes that
in to web server mining process for collecting the entire terms
related to that entered key phrase. Few pages gives legal and
authenticated matter for the user, which they really wanted to
access. Whereas many other pages are bringing them some
unwanted and malicious codes of pages or virus activity pages
to harm user’s activities and the system’s functions. Generally
a web page attacks the targeted system by faulty instructions
and malevolent programs through some sort of intrusion
methodologies are called as phishing. In this attacking method
user is set to access unknown or illegal sites by the way of
accessing some unidentified websites link imbedding with
legal site contents. Once victim’s system performance got
compromised then hackers started to do attack. To avoid this
kind of molestations, user needs to understand reliability of
web page’s contents before started to continue browsing. This
research paper is going to present web crawler architecture,
design complexities and implementation for scrapping web
contents from visited web pages for indentifying their
reliability and freshness.
Keywords: Intrusion Detection System, Parser, Scanner,
Search Engine Optimization, Semantic Web, Unstructured
Information Management Architecture, Web crawler, Web
Robot.
I. INTRODUCTION
In this internet era every consumer wants to access
millions of web pages for a single query passed on search
engines. Search engines using more optimization
techniques to bring the exact content to the victims still by
analyzing the keyword phrase it is in the state of producing
more output pages to the user. For an example if a client
machine passing the term “Crawler” in a google search
engine then it brings about more than 2,57,00,00,000 sites
with searched content. Same keyword passed in to yahoo
search engine then it bringing more than 19,400,000
numbers of web sites and if that keyword is searched at
duckduckgo search engine then it produced the result pages
in terms of 34 billion web site count.
Revised Manuscript Received on April 15, 2020
S. Ponmaniraj *, Research Scholar, School of Computing Science
and Engineering, Galgotias University, Uttar Pradesh, India,
ponmaniraj@gmail.com
Dr. Tapas Kumar , Professor, School of Computing Science and
Engineering, Galgotias University, Uttar Pradesh, India,
tapas.kumar@galgotiasuniversity.edu.in
Dr. Amit Kumar Goel, Professor, School of Computing Science and
Engineering, Galgotias University, Uttar Pradesh, India,
amit.goel@galgotiasuniversity.edu.in
From these enormous counting of web sites few sites
only legally authorizing the original data to hold and others
are simply holding the key terms and coming into the place
to increase site counts[1] [18]. Some of these pages are
maintaining by the hackers to violet the victims systems
information and functions of the systems.
On the internet environment, any user can access legal
content on authorized web sites. Several sites are tied up
with some other corporate for advertising their business. So
that legal sites allow them to embed their trade mark logo
or advertising image along with their company URL.
Unknowingly once this logo or image got click action then
automatically user is directed to the targeted web sites
which user doesn’t want to look for. If the directed page is
a legal then there will not be any issue for the victim’s data
otherwise that link will find the loop hole to make users
sensible data or the systems to get compromise with their
security where hackers can play well their attacking
process. In general any unauthorized or unwanted activities
happening at the legal sites by illegal links in the form of
any interruptions like asking users to click on some
buttons, to follow some links unnecessarily, make users to
accept something or by posting some spam like images and
videos are called Intrusions. Following chapter contents
will extract the process taken on web searching and
crawling.
II. RELATED WORKS
Andas Amrin et.al, presented their views on Fish
algorithms to identify the web contents by the way of
accessing its score values. At the time of visiting every
URL it update the first link and then the next linked URLs
are processed by ranking their value, for relevant (1) and
non-relevant (0). Depth of the related URL (Child)
assigned with predetermined value in the list otherwise
URL will be dropped. Their implementations on searching
works like a browsing with optimized stratagems. It is
faster than other algorithm to set parental and child URLs.
In this model downloading web documents from WWW is
consuming more time. During the progress it creates high
traffic due to accessing network resources and the hidden
web crawling is impossible [2].
Herseovici M et.al [3], and Lei Luo et.al [4], developed
an efficient algorithm named (Adaptive) shark searching
algorithm to give remedial actions