IJSRST173743 | Received : 10 Sep 2017 | Accepted : 19 Sep 2017 | September-October-2017 [(3) 7: 172-181]
© 2017 IJSRST | Volume 3 | Issue 7 | Print ISSN: 2395-6011 | Online ISSN: 2395-602X
Themed Section: Science and Technology
172
Web Page Noise Removal - A Survey
Dr. S. Vijayarani
1
, K.Geethanjali
2
1
Assistant Professor, Department of Computer Science, Bharathiar University, Coimbatore, Tamilnadu, India
2
M.Phil Research Scholar, Department of Computer Science, Bharathiar University, Coimbatore, Tamilnadu, India
ABSTRACT
Web mining is used to extract useful information from websites which includes web documents and hyperlinks of
web sites. The World Wide Website contains a wide range of web pages which are very useful to many users. Web
pages are composed of different kinds of data, such as text, audio, video and images. In addition to this, nowadays,
web pages contain a large amount of unnecessary data, e.g., advertisement posters, navigation bars and
disclaimer/copyright notices. These types of unnecessary data are called as noisy data. This has created the
distractions to the user and also increases the time to perform searches and browsing tasks. To perform in-depth
analysis of web data or web content mining, the first and essential step is to remove the noises which are existing in
the web pages, and then we can extract useful information from the web pages. Removing noise from the web page
is challenging task in web content mining. This main objective of this paper is to discuss the basics of web content
mining, types of noises, techniques used for noise removal and different models used in the literature.
Keywords : Web Content, Web page, Global Noise, Local Noise, Filtering.
I. INTRODUCTION
Web mining is used to extract knowledge from web data.
Web mining is classified into three main categories, i.e.
Web content mining, Web structure mining and Web
usage mining data. Web content mining is used to mine
data from the content of web pages. Web pages consist
of text, graphics, tables, data blocks and data records [1].
Web Content Mining uses the ideas and principles of
data mining and knowledge discovery process. Web
usage mining is also known as web log mining, which is
used to analyze the behavior of website users. It can be
used to predict the user behavior while the user interacts
with the web. Web structure mining is based on the link
structures. It can be used to categorize web pages and is
useful to generate information such as similarity and
relationship between different websites.
Extracting the useful information from web pages
becomes essential task. The web page is a medium for
accessing the information from different sources.
Extracting the information from various resources has
many problems like finding the useful information,
extracting the knowledge from large data set and
learning about individual users. To resolve these
problems various methods and techniques are developed.
The information technology field has a massive amount
of data that needs to transform or extract into useful
information. This extracted information can be used for
several applications. To extract the useful information
there are different kinds of algorithms and techniques
are available for different types of data.
Web content mining includes various kinds of data such
as: image, audio, video and text. In web mining web
documents can be divided into three kinds namely core
information, redundant information and hidden
information [13]. Web documents also comprise “hidden
information” like HTML tags, script language and
programming comments, which is called „hidden
information‟. The repeated data in web documents are
called as redundant information. The main content or
information of the web page like, news article are known
as the core information.
In a web mining system, the input data moves through
the three different stages to reach its final result: namely
preprocessing, data mining and post processing [2]. Pre-
processing may include removing attributes that are
irrelevant and cleaning the data from noisy information.
Data mining is a generic term that includes the
techniques and tools used to extract useful information