International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-9 Issue-3, January 2020 1966 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: C9023019320/2020©BEIESP DOI: 10.35940/ijitee.C9023.019320 Abstract: Web pages has pieces of information which are of unequal importance like navigational bar, copyright notice, links, advertisement etc. and these are considered as noise or insignificant items of web page for web mining. Web page informative content is only useful for performing effective web mining task and presence of noise on web page can hamper the result of this task. Web page has several features including information location, occupied area and its contents. Content data in different portions of an internet web page has dissimilar significance weights according to its location, occupied location and content that are considered to be features of the web page. The position of contents and importance of contents play a vital role in identification of noise in web pages for removal. In this paper web page feature based method is proposed for identification of noise from web pages. K-means clustering technique is applied to classify main content information and noise content information into two clusters of web pages based on these features. For performance evaluation of clustering method, accuracy, precision, f-measure, and recall are calculated. Keywords: Noise, Feature Extraction, Clustering, HTML Tag, Tag Weight, Web Pages. I. INTRODUCTION In the huge World Wide Web network, web pages contain large amounts of informative data. The researchers always want only useful content from the web pages that useful content needs to be processed. Data mining on web become a main task for detecting useful data from the web. Usually web information has large amounts of noise data and that data is not useful for mining such as navigation bars, links, advertisements, copyright notices etc. Demarcating important information from noisy content is essential because the noise misguides user interest. Performance of Web mining can be improved by identifying and removing noise from Web pages. This paper proposes web page feature based method which is used for identification and removal of noise from web pages and helps efficient web mining operations. This method group’s data into two clusters such as noise data and non-noise data using two feature variables (i.e. final tag content weight and location feature weight) of web pages through k – means clustering technique. Web page clustering automatically categorizes data into different groups. Revised Manuscript Received on January 5, 2020 * Correspondence Author S. S. Bhamare, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, India Email: ssbhamare.nmu@gmail.com B. V. Pawar, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, India Email: bvpawar@hotmail.com In performance evaluation, metrics such as Accuracy, Precision, F-measure, and Recall are used. II. RELATED WORK Noisy web page data cleaning is an important task, by retrieving and extracting main content data and eliminating noise data from Web pages. Many researchers have worked in this area. Yong Zhang et al. proposed a different method to find the main content data block of the web page using web page purification based on improved Document Object Model (DOM) and statistical learning. In this approach produced block tree structure which is helpful for information retrieval and information extraction and web page classification using statistical learning [2]. Li Xiaoli et al. suggests a new technique to removes noise data in web page classification. Initially it displays the presentation of a web page based on HTML tags, then it uses a new distance formula and removes the noise data using similarity measure [3]. Cai Deng et al. proposed Vision Base Page Segmentation (VIPS) algorithm. It uses full web page layout features and some experimental rules to partition the web page at the semantic level. In this method main restriction is performing visual rendering and partition of web pages is resource intensive [4]. Zhao Cheng-li et al. uses new style tree model to identify and remove web page noise it uses new. It determines whether element node is noisy or not through information based measures. This proposed technique is able to increase the mining outcome considerably [5]. .Bhamare et al. discussed different supervised and unsupervised web page noise cleaning techniques [6]. III. SYSTEM FLOW DIAGRAM OF PROPOSED METHOD The process flow diagram (Figure 1.) shows overall web page noise detection and removal process of proposed method. This proposed web page feature based method consists of following four steps, i) Feature extraction: This step of proposed method determines and extracts possible important features from webpage’s. ii) Feature selection: In this step we find the best required features set for proposed method. Feature Based Identification of Web Page Noise through K-Means Clustering S. S. Bhamare, B. V. Pawar