(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 5, No.2, 2014 105 | Page www.ijacsa.thesai.org OLAWSDS: An Online Arabic Web Spam Detection System Mohammed N. Al-Kabi Faculty of Sciences & IT Zarqa University Zarqa, Jordan Heider A. Wahsheh Computer Science Department College of Computer Science King Khalid University Abha, Saudi Arabia Izzat M. Alsmadi Information Systems Department College of Computer & Information Sciences Prince Sultan University Riyadh 11586, P. O. Box 66833, Saudi Arabia Abstract For marketing purposes, Some Websites designers and administrators use illegal Search Engine Optimization (SEO) techniques to optimize the ranking of their Web pages and mislead the search engines. Some Arabic Web pages use both content and link features, to increase artificially the rank of their Web pages in the Search Engine Results Pages (SERPs). This study represents an enhancement to previous work in this field. It includes the design and implementation of an online Arabic Web spam detection system, based on algorithms and mathematical foundations, which can detect the Arabic content and link web spam depending on the tree of the spam detection conditions, beside depending on the user’s feedback through a custom Web browser. The users can participate in making the decision about any Web page, through their feedbacks, so they judge if the Arabic Web pages in the browser are relevant for their particular queries or not. The proposed system uses the extracted content and link features from Arabic Web pages to determine whether to label each Web page as a spam or as a non- spam. This system also attempts to learn from the user’s feedback to enhance automatically its performance. Statistical analysis is adopted in this study to evaluate the proposed system. Statistical Package for the Social Sciences (SPSS) software is used to evaluate this new system which considers the users feedbacks as dependent variables, while Arabic content and links features on the other hand are considered independent variables. The statistical analysis with the SPSS is used to apply a variety of tests, such as the test of the analysis of variance (ANOVA). ANOVA is used to show the relationships between the dependent and independent variables in the dataset, which leads to solving problems and building intelligent decisions and results. KeywordsArabic Web spam; content-based; link-based; Information Retrieval I. INTRODUCTION Arab Internet users suffer from two problems, the first problem is the low percentage of the Internet Arabic content, and the second problem is Arabic Web spam which leads Web search engines to refer to irrelevant Web pages. The success of spamming techniques to deceive a search engine leads the Internet users to lose credibility in the search engine they used, in addition to some other negative aspects of spamming such as wasting the time and efforts of the search engine users. This study proposes an integrated system to reduce the Arabic content and link Web spam, and filter the search engines from these malicious Arabic web pages. Although this study relies on a set of content and link Arabic Web spam conditions that have been used before, however this study differs from its predecessors by involving the Web search engine users to assess the relevancy of Arabic Web pages rendered by Search Engine Results Pages (SERPs). The proposed system allows users to use a synchronization technique, in which the users can browse the Arabic Web pages, and give their feedbacks assessment for each visited Web page under some security considerations and confidentiality. The use of a synchronization technique helps the proposed system to ensure that the submitted assessment is conducted by users not agents and robots. The evaluation of the results of the proposed system is based on the use of Statistical Package for the Social Sciences (SPSS) software, which enables us to conduct a statistical analysis, and confidence predictive method. SPSS software considers Arabic Web spam features as independent variables, while it considers the Search Engine Ranking (SER), TrustRank, and link popularity scores as dependent variables. The statistical analysis in SPSS applies a variety of tests, such as the test of the analysis of variance (ANOVA). ANOVA has two types (one-way and two-way analysis of variance). In this study we used two-way analysis of variance to show the relationships between the dependent and independent variables in the dataset. The main aim of this research is the development of a system which can filter the search engines from unwanted and spam Web pages based on the Web pages’ features and the users which have a main role in determining the relevancy of SERPs with their different queries. The rest of the paper is divided as follows: Section two presents selected related work of Web spam studies. Section three presents developed system overview. Section four elaborates experiments and results. Section five summaries the paper and its contribution. II. RELATED WORKS The literature is rich with many studies related to Web spam, where this topic is studied from different perspectives. This section presents few of these studies which are closely This research is funded by the Deanship of Research and Graduate Studies in Zarqa Private University / Jordan.