International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-3, January 2020
1966
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication
Retrieval Number: C9023019320/2020©BEIESP
DOI: 10.35940/ijitee.C9023.019320
Abstract: Web pages has pieces of information which are of
unequal importance like navigational bar, copyright notice, links,
advertisement etc. and these are considered as noise or
insignificant items of web page for web mining. Web page
informative content is only useful for performing effective web
mining task and presence of noise on web page can hamper the
result of this task. Web page has several features including
information location, occupied area and its contents. Content data
in different portions of an internet web page has dissimilar
significance weights according to its location, occupied location
and content that are considered to be features of the web page.
The position of contents and importance of contents play a vital
role in identification of noise in web pages for removal. In this
paper web page feature based method is proposed for
identification of noise from web pages. K-means clustering
technique is applied to classify main content information and
noise content information into two clusters of web pages based on
these features. For performance evaluation of clustering method,
accuracy, precision, f-measure, and recall are calculated.
Keywords: Noise, Feature Extraction, Clustering, HTML Tag,
Tag Weight, Web Pages.
I. INTRODUCTION
In the huge World Wide Web network, web pages contain
large amounts of informative data. The researchers always
want only useful content from the web pages that useful
content needs to be processed. Data mining on web become a
main task for detecting useful data from the web. Usually web
information has large amounts of noise data and that data is
not useful for mining such as navigation bars, links,
advertisements, copyright notices etc. Demarcating important
information from noisy content is essential because the noise
misguides user interest. Performance of Web mining can be
improved by identifying and removing noise from Web pages.
This paper proposes web page feature based method which
is used for identification and removal of noise from web pages
and helps efficient web mining operations.
This method group’s data into two clusters such as noise
data and non-noise data using two feature variables (i.e. final
tag content weight and location feature weight) of web pages
through k – means clustering technique. Web page clustering
automatically categorizes data into different groups.
Revised Manuscript Received on January 5, 2020
* Correspondence Author
S. S. Bhamare, School of Computer Sciences, Kavayitri Bahinabai
Chaudhari North Maharashtra University, Jalgaon, India
Email: ssbhamare.nmu@gmail.com
B. V. Pawar, School of Computer Sciences, Kavayitri Bahinabai
Chaudhari North Maharashtra University, Jalgaon, India
Email: bvpawar@hotmail.com
In performance evaluation, metrics such as Accuracy,
Precision, F-measure, and Recall are used.
II. RELATED WORK
Noisy web page data cleaning is an important task, by
retrieving and extracting main content data and eliminating
noise data from Web pages. Many researchers have worked in
this area.
Yong Zhang et al. proposed a different method to find the
main content data block of the web page using web page
purification based on improved Document Object Model
(DOM) and statistical learning. In this approach produced
block tree structure which is helpful for information retrieval
and information extraction and web page classification using
statistical learning [2].
Li Xiaoli et al. suggests a new technique to removes noise
data in web page classification. Initially it displays the
presentation of a web page based on HTML tags, then it uses a
new distance formula and removes the noise data using
similarity measure [3].
Cai Deng et al. proposed Vision Base Page Segmentation
(VIPS) algorithm. It uses full web page layout features and
some experimental rules to partition the web page at the
semantic level. In this method main restriction is performing
visual rendering and partition of web pages is resource
intensive [4].
Zhao Cheng-li et al. uses new style tree model to identify
and remove web page noise it uses new. It determines whether
element node is noisy or not through information based
measures. This proposed technique is able to increase the
mining outcome considerably [5].
.Bhamare et al. discussed different supervised and
unsupervised web page noise cleaning techniques [6].
III. SYSTEM FLOW DIAGRAM OF PROPOSED METHOD
The process flow diagram (Figure 1.) shows overall web
page noise detection and removal process of proposed
method.
This proposed web page feature based method consists of
following four steps,
i) Feature extraction: This step of proposed method
determines and extracts possible important features
from webpage’s.
ii) Feature selection: In this step we find the best
required features set for proposed method.
Feature Based Identification of Web Page Noise
through K-Means Clustering
S. S. Bhamare, B. V. Pawar