Unsupervised automatic keywords and keyphrases extractor for web documents Mohamed Abdou Information Systems Department Faculty of Computers and Information Helwan University, Egypt mabdou@fci.helwan.edu.eg Marwa Salah Information Systems Department Faculty of Computers and Information Helwan University, Egypt marwa.salah@fci.helwan.edu.eg Sayed AbdelGaber Information Systems Department Faculty of Computers and Information Helwan University, Egypt sgaber@fci.helwan.edu.eg Abstract— Keywords extraction is a vital process that aims to find out the most valuable terms and phrases that best describe the content of web documents. The tremendous development of the web content on the internet has made it necessary to automate the process of extracting keywords and keyphrases. This paper aims to present an online keywords extractor that employs a combination of statistical metrics to automatically extract keywords and keyphrases. The proposed extractor is responsible for parsing the web document, extracting the unique keywords and keyphrases that best describe its content, and finally ranking them according to their importance weights. This paves the way for adding appropriate semantic markup that accelerates the indexing of the website and improves its visibility in search engines. Keywords- Keywords extraction; keyphrases extraction; semantic annotation. I. INTRODUCTION Extracting the appropriate keywords and keyphrases from web documents is a very important task, which plays a vital role in analyzing websites and enriching their content with semantic annotations. Keywords considered the core concepts that provide a concise summary of web documents [1]. Keyphrases are a set of two or more words that reflect the main topics of a web document. Traditionally, keywords were widely used for different purposes in text processing applications. Also, keywords enabled fast searching for documents more accurately. Moreover, keywords were the basis for indexing web documents in the early days of Search Engine Optimization (SEO) [2,3]. Also, conducting search through keywords is a very powerful method which enables scanning a huge number of documents efficiently [4]. Manually assigning keywords and keyphrases to web documents are a time-consuming and tedious task. Moreover, the number of web documents is increasing rapidly, in which automatic extraction received a notable attention in the past years, and many researchers investigated new approaches to such demanding issue. Approaches that can extract accurate and relevant keywords from web documents automatically can have a significant impact on optimizing websites for search engines [5]. The automatic keyword extraction systems can be classified into two main groups: supervised and unsupervised. Supervised systems depend on a training data set to extract a set of features that represent each document in the collection. Then, the representation used to learn a model and also make predictions of new instances from the text collection. Having a training data set labeled by humans is seen as the main drawback of such approaches. Unsupervised approaches can deal with the issue of finding hidden structure in unlabeled data [6]. Most of the well-known techniques for keyword extraction encounter different accuracy and scalability problems. To overcome such limitations, new approaches and solutions are being constantly proposed [5]. This paper presents unsupervised automatic extractor that employs a combination of statistical metrics including; term frequency, the heading weight and term position to extract the most relevant keywords and keyphrases that best describe the web document. Which allows adding appropriate semantic markup that accelerates the indexing of the website and improves its visibility in search engines. The rest of this paper organized as follows: Section 2 provides a brief background on the process of keywords extraction and also highlights related work. Section 3 presents the proposed work and the algorithm used to build the extractor. Section 4 presents and discusses the experimental results. Finally, Section 5 concludes the paper and outlines future work. II. BACKGROUND AND RELATED WORK Keyword extraction is one of two main categories that distinguish the automatic keyword indexing process. Along with keyword assignment, they both address the same problem which is getting the most appropriate keywords in a document. Keyword extraction approaches can be categorized mainly into supervised approaches which require annotated data source, and unsupervised approaches which don’t require annotations in advance [5]. The different techniques which represent both supervised and unsupervised approaches are shown in Figure 1. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 10, October 2017 225 https://sites.google.com/site/ijcsis/ ISSN 1947-5500