Baseline Keyphrase Extraction Methods from Hebrew News HTML Documents Yaakov HaCohen-Kerner, Ittay Stern, David Korkus Department of Computer Science, Jerusalem College of Technology (Machon Lev) 21 Havaad Haleumi St., P.O.B. 16031, 91160 Jerusalem, Israel Abstract: Most documents do not include keyphrases. There are a few keyphrase extraction systems for documents written in English. However, there is no such a system for the Hebrew language. In this ongoing work, we investi- gate baseline methods that extract keyphrases from Hebrew news HTML documents. These methods have been tested on a set of documents. Each document has an accompanying file containing keyphrases extracted by students who read the original documents. The two best baseline methods were found as: Term Frequency (TF) and the First N terms (FN). These results are similar to those discovered for documents written in English. Key-Words: Extraction, HTML Documents, Keyphrases, Keywords, Text Summarization, Hebrew 1 Introduction The explosion of information is hard to handle and reading everything may be very time consuming. Various kinds of summaries (e.g.: headlines, ab- stracts and conclusions) enable people to decide whether they are willing to read the whole text or not. Keyphrases, which can be regarded as very short summaries, may help even more. For instance, key- phrases can serve as an initial filter when retrieving documents. Unfortunately, most documents do not include keyphrases. There are a few automatic keyphrase extraction systems for documents written in English. However, there is no such a system for the Hebrew language. In this ongoing work, we investigate baseline methods that extract keyphrases from Hebrew news HTML documents. This paper is organized as follows. Section 2 gives background concerning extraction of key- phrases, the Hebrew language, and baseline extrac- tion methods. Section 3 describes the proposed model. Section 4 presents the experiments we have carried out. Section 5 concludes and proposes future directions for research. 2 Background 2.1 Extraction of Keyphrases A keyphrase is an important concept, presented either in a single word (unigram), e.g.: ‘learning’, or a col- location, i.e., a meaningful group of two or more words, e.g.: ‘machine learning’, and ‘natural lan- guage processing’. The keyphrases provide general information about the contents of the document and can be seen as an additional kind of a document ab- straction. The basic idea of keyphrase extraction for a given article is to build a list of words and collocations sorted in descending order, according to their fre- quency, while filtering general terms and normalizing similar terms (e.g. “similar” and “similarity”). The filtering is done by using a stop-list of closed-class words such as articles, prepositions and pronouns. The most frequent terms are selected as keyphrases since we assume that the author repeats important words as he advances and elaborates. Example of a system that applied this method among other basic methods is one developed by HaCohen-Kerner [8]. In this system, extraction of keyphrases for academic papers written in English is done from their abstracts and titles. Other three key- phrase extraction systems dealing with whole English documents are discussed below. Turney [17] develops a keyphrase extraction sys- tem. This system uses a few baseline extraction methods, e.g.: TF (term frequency), FA (first appear- ance of a phrase from the beginning of its document normalized by dividing by the number of words in the document) and TL (length of a phrase in number