International Journal of Computer Applications (0975 8887) Volume 181 No. 8, August 2018 36 Web Mining Techniques to Block Spam Web Sites Esraa M. EL-Mohdy Computer Science Department Faculty of Specific Education Mansoura University A. F. El-Gamal Computer Science Department Faculty of Specific Education Mansoura University Hanan E. Abdelkader Computer Science Department Faculty of Specific Education Mansoura University ABSTRACT The aim of this paper is to introduce a system based on web mining techniques to prevent spamming web pages. The system relies on content analysis, used features are Uniform Resource Locator(URL), Number of words in page Title, Globally Popular Keywords(GPK) and N-GRAM. The proposed system used Decision Tree(DT) rules ; which is the best classifier to detect Web spam content. It produces accuracy of .97 % in detecting spam web sites. Keywords Web Mining, Spam Web Sites ,Decision Tree. 1. INTRODUCTION With the great growing of the World Wide Web(WWW), there is a massive amount of web pages manned on every imaginable human rights to daily News for sports as described in online news articles, forums, and blogs. These pages may contain also a mix of computer data like graphics, videos, voices, multimedia, and pictures. The web includes a large number of users from different geographic regions. Users go to search engines such as Yahoo , Bing and Google for useful information. It may be recovered by millions of web pages for each search request , but only searches for a few selected web pages [1] . Given the amazing amount of information that can be obtained on internet, users usually specify beneficial web sites by requesting a search engine. At the request of the search determines the relevant search engine on the web site and displays users links to these sites, usually in batches of 10-20 link [2]. Search engine spam is an undesirable site that receives a lot of revenue from processing the content and links of a web site. People who spam search engines are called spammers or spammy content for search engines [3]. Spam is any deferred action just to promote a website's web page in search engine results, commensurate with the true value of the page. Web Spam is a web page that is the result of spam. Spam on the web is intentional doctrinaire of search engines indexes. It is one of the methods of search engines optimization. Implementing spam content on the search engine reduces unwanted and excessive results[4]. To determine the most useful information among the countless web pages available, users firstly depend on search engine. Search engines usually classify a huge number of web pages and provide pages which appear more related to user queries ranked by popularity and relevance. Users usually visit higher- rated web pages and ignore other pages [5]. The purpose of this Processing is to make their pages more suitable for user requests, thus doctrinaire search engines to raise the rank of a spam website to be included in top ten links that appear on the front page of Search Engines Results Page (SERP) [1]. The site contains malicious software that automatically installs itself on the system when the site is opened. The site can also affect the financial situation by continuing to own information such as bank account number, password ,and other financial information, and the Internet spam can be very serious from the user point of view. Since a spam site can attack the victim's system in different ways [4]. 2. RELATED WORK The literature contains numerous papers on the subject of Spam Web sites ,where the subject is examined from several points of view. This section displays little of these papers that are linked to paper topic: Detecting spam on the Internet, detecting Arabic and non Arabic web based spam, and dedicated studies to assess relation between spam and popularity. Mohammed A. Saleh , et.al had represented improvement of Arabic spam web pages detection using new robust features . They have suggested unprecedented collection of features which mend the detection of spammy Arabic web sites. These Features contain: Globally Popular Keywords (GPK), Sentence Level Frequent Words (SLFW), and Character N-Gram Graph (CNGG) features. They referred to new proposed set of features as features B in contrast to the state-of-art featured which referred by features A. they have combined their (B) features with the state-of-art (A) features to get (AB) features and then fed the resulting AB features into different classification algorithms include Ensemble Boosting with Bagging and Decision Tree ensemble methods, Random Forest classifiers, and Decision Tree J48 to obtain their results. In their results they obtained an F-measure of about 99.54% with the Random Forest classifier. They applied their new features on a dataset of about 15962 Arabic web pages which containing spam and non spam sites. they also compared their results with results of a previous studies in the field of Arabic spam web pages and they found that, their results (F-measure of 99.64%) have exceeded all their results (98%) with the same dataset they used in their study (Dataset 2010).[1] Alexandros Ntoulas, et.al had represented detecting spam web pages through content analysis. They go on their realization of web spam : injecting theatrical generated pages into the web to effectiveness search engine results, for driving traffic to assured sites for profit or fun. This study looks at some of the techniques that are not previously described to automatically detect spam pages, and examine the validation of these techniques in segregation and when assembled using classification algorithms such as Amount of anchor text ,Number of words in the page title, Average length of words , Number of words in the page, Fraction of visible content, Compressibility and Independent n-gram likelihoods. their inferences correctly recognize 2,037 (86.2%) of the 2,364 spam