AbstractWith the World Wide Web (WWW) being used to its full potential, automatic classification of web pages into web directories has become more significant. These web directories help the search engines to provide users with relevant and quick retrieval results. In this paper a novel approach to web page classification is implemented by combining the k nearest neighbor classifier (kNN) and association rule mining algorithm. The web pages are preprocessed and discretized before inducing the classifier. The proposed method for web page classification uses a) a feature weighting scheme based on association rules and b) a distance weighted voting scheme. This distance weighted voting scheme enables the model to work for any value of k, being odd or even. Experiments done on a benchmarking data set namely, WebKB have shown that the web page classification accuracy by the proposed method is significantly better than many of the existing web page classification methods.. Keywords—web page classification, discretization, kNN classifiers, association rules, WebKB. I. INTRODUCTION EB Page classification (WPC) is the technique of segregating web pages into specific categories so as to group together web pages of similar significance. Through classifying web pages, retrieval of precise and exact information is possible, which would not only enable faster searching of data but also maintain quality of information retrieved. WPC has strong connection with natural language processing, data mining, text mining, machine learning, and information retrieval and knowledge management. Many machine learning algorithms have been tweaked for web page classification. The k nearest neighbor (kNN) classifier and association rule mining are the two popular data mining algorithms. The kNN classification algorithm is simple and easy to implement. However it suffers from many issues like 1) there is no systematic approach for choosing the best value of k, 2) the simple majority voting scheme degrades the classification accuracy, whenever there is an equal class distribution. Association rule mining task generates rules that help to identify associations/correlations between items in a J. Alamelu Mangai is with Birla Institute of Technology & Science Pilani, Dubai Campus, Dubai, 345055, UAE (corresponding author’s phone:+971503987928; fax:+97144200844; e-mail:mangai@bits-dubai. ac.ae) Satej Miling Wagle is with Birla Institute of Technology & Science Pilani, Dubai Campus, Dubai, 345055, UAE (e-mail: satejwagle@gmail.com) V.Santhosh Kumar is with Birla Institute of Technology & Science Pilani, Dubai Campus, Dubai, 345055, UAE (e-mail:santhoshkumar@bits- dubai.ac.ae). transaction data base. These rules are generated using two interestingness measures called min_support and min_confidence as threshold. In this paper, the performance of the kNN classifier is improved by using a new feature weighting scheme and a new distance weighted voting scheme. The feature weighting scheme uses the rules generated for the web page data set using min_support and min_confidence as threshold.The rest of the paper is organized as follows: Section 2 highlights the related work, proposed work is described in Section 3, details of the experiments done are summarized in Section 4, and Section 5 highlights the results and findings II. RELATED WORK Many approaches for automatic WPC have been witnessed over years in literature. The structure of the web document and the images present in them are used to classify them in to various categories in [1]. The performance of the web page classifier is improved using feature selection subsets. A minimum number of highly qualitative features are found by integrating cfssubset evaluator with term frequency method [2] and Ward’s minimum variance [3]. The association between the blocks in a web page [4] is used to frame a query with content based classification framework to classify a web page. Visual features of a web page like color and edge histograms, Gabor and texture features [5] summaries generated by human experts are used in [6].These approaches of web page classification cannot be applied in situations which suffer from hardware and software limitations. Further, they require lot of human expertise and are computationally complex. The various technologies in web information extraction have been explored in [7] and the authors have expressed their concern that many researchers start with the complex approaches directly rather than trying out the simpler ones first. It is proved in [8] that Naïve Bayes, NB and C4.5 decision tree models are fast consistent, easy to maintain and accurate in the training courses domain. NB classifier based on Independent Component Analysis [9], Hidden Naïve Bayes [10] with Symmetrical Uncertainty for word selection perform more satisfying in web page categorization. Motivated by these facts this paper focuses on content based web page classification using a simple machine learning method namely k-nearest neighbor, kNN, classification. As the class distribution in the training set is uneven, a method to choose a different value of k for each category is proposed in [11]. The traditional support vector machines (SVM) is combined with A Novel Web Page Classification Model using an Improved k Nearest Neighbor Algorithm J. Alamelu Mangai, Satej Milind Wagle, and V. Santhosh Kumar W 3rd International Conference on Intelligent Computational Systems (ICICS'2013) April 29-30, 2013 Singapore 49