International Journal of Computer Applications (0975 – 8887) Volume 61– No.1, January 2013 1 Keyword Extraction using Semantic Analysis Mohamed H. Haggag Computer Science Department Faculty of Computers and Information, Helwan University ABSTRACT Keywords are list of significant words or terms that best present the document context in brief and relate to the textual context. Extraction models are categorized into either statistical, linguistic, machine learning or a combination of these approaches. This paper introduces a model for extracting keywords based on their relatedness weight among the entire text terms. Strength of terms relationship is evaluated by semantic similarity. Document terms are assigned a weighted metric based on the likeness of their meaning content. Terms that are strongly co-related to each other are highly considered in individual terms semantic similarity. Provision of the overall terms similarity is crucial for defining relevant keywords that most expressing the text in both frequency and weighted likeness. Keywords are recursively evaluated according to their cohesion to each other and to the document context. The proposed model showed enhanced precision and recall extraction values over other approaches. Keywords Keywords Extraction, Sematic Similarity, Semantic Relatedness, Semantic Analysis, Word Sense 1. INTRODUCTION Keywords are useful for readers investigating in academic, business, social and other articles’ purposes. They are used to give an insight of the presented article. They give a clue about the article concept so that readers can decide their interest. Their importance for search engines is to precise inquiry results and shortens the response time. Keywords can be viewed as content classifiers. Keywords processing tools are used for scanning large amount of text corpus in short times. Extracting keywords from text is the process of addressing contextual representative terms. This includes access, discover and retrieval of different words and phrases highlighting linguistic senses and contents sub topics. Conceptually, cross- lingual keywords extraction could be a core and basic function for different natural language processing applications like text summarization, document clustering (unsupervised learning of document classes), document classification (supervised learning of document classes), topic detection and tracking, information retrieval and filtering, question answering and other text mining applications including search engines [25, 26 and 27]. Keywords extraction methodologies are classified into two main categories; quantitative and qualitative [1]. Quantitative techniques are based on set of concordances and statistical relations in addition to formal linguistic processing. Statistical approaches are the simplest keyword extraction models. Words statistics like word frequency, TF*IDF, co-occurrence are used to identify the keywords in the document. The basic intuition underlying this approach is that important concepts are more frequently referenced within the text than non-important ones. However, relying on simple counting of the most frequent words is not sufficient to address proper keywords. Accordingly, further refinements are needed to such approaches. Concordances are a set of words within the text along with their context reference. Statistical relations are advanced statistics that can be applied to data set resulted from the linguistic parsing of words and phrases. Qualitative techniques are based on semantic relations and semantic analysis. Semantic analysis relies on semantic description of lexical items. The association of each lexical item with possible domains, in addition to words morphological and syntactic parsing, defines the semantic and conceptual relations of phrases. Extraction of keywords provides highly structured conceptual relations that relate to the content of the text. Such techniques are precise and reliable than plain statistical methods that results in important words but maybe unqualified. Quantitative methods are widely used for a variety of text processing applications like text summarization, internal document cohesion, multi documents inter textual cohesion, authorship attribution, style consistency, writing skills development. On the other hand, the main application of qualitative techniques is content analysis. Statistical keywords extraction from a single document presented in [2] starts by extracting frequent terms then extract a set of co-occurrences between each term and the frequent terms; i.e., occurrences within the same sentences. The Co-occurrence distribution shows the importance of a term in the document such that if the probability distribution of co-occurrence between any term and the frequent terms is biased to a particular subset of frequent terms, then the term is likely to be a keyword. The degree of bias of a distribution is measured by the χ2-measure. Testing using text corpus, where documents are represented using a vector of features based on word co-occurrence was provided by [3]. A global unsupervised text feature selection approach based on frequent item set mining was applied. Each document is represented as a set of words that co-occur frequently in the given corpus of documents. Then documents are grouped to identify the important words using a locally adaptive clustering algorithm designed to estimate words relevance. A different approach was used in [4], where a genetic algorithm for keyword extraction in a supervised environment was developed. According to [5], two extraction techniques were compared on a biological domain by extracting keywords from