ABHATH AL-YARMOUK: "Basic Sci. & Eng." Vol. 22, No. 1, 2013, pp. 75- 95
_______________________________________
© 2012 by Yarmouk University, Irbid, Jordan.
* Faculty of Sciences & IT, Zarqa University, Zarqa - Jordan
** Deparment of Computer Information Systems, Yarmouk University, Irbid, Jordan.
*** College of Business and Information System, Dakota State University, Madison, SD, USA
Keyword Extraction Based on Word Co-Occurrence
Statistical Information for Arabic Text
Mohammed Al-Kabi*, Hassan Al-Belaili
**
, Bilal Abul-Huda
**
and Abdullah H. Wahbeh
***
Received on Jan. 22, 2012 Accepted for publication on June 24, 2012
Abstract
Keyword extraction has many useful applications including indexing, summarization, and
categorization. In this work we present a keyword extraction system for Arabic documents using
term co-occurrence statistical information which used in other systems for English and Chinese
languages. This technique based on extracting top frequent terms and building the co-occurrence
matrix showing the occurrence of each frequent term. In case the co-occurrence of a term is in the
biasness degree, then the term is important and it is likely to be a keyword. The biasness degree of
the terms and the set of frequent terms are measured using 2. Therefore terms with high 2
values are likely to be keywords. The adopted 2 method in this study is compared with another
novel method based on term frequency - inverted term frequency (TF-ITF) which tested for the
first time. Two datasets were used to evaluate the system performance. Results show that the 2
method is better than TF-ITF, since the precision and the recall of the 2 for the first experiment
was 0.58 and 0.63 respectively and for the second experiment the 2 accuracy was 64%. The
results of these experiments showed the ability of the 2 method to be applied on the Arabic
documents and it has an acceptable performance among other techniques.
Keywords: Keyword extraction, Arabic Keyword extraction, Information Retrieval,
Natural language processing.
Introduction
Today, the internet contains a huge amount of electronic information such as
papers, articles and news. This huge volume shows the necessity to have an effective
ways to retrieve and filter the desired information. Many search engines have the ability
to retrieve the most relevant document but there is a need to show a brief description of
the retrieved information especially when the human’s incapable to summarize this huge
amount of information. Keyword extraction techniques are important for information
seekers, since it allow them to get what they want just by looking at the suggested
keywords so determine what they should read. The need of keyword extraction comes
from the huge growth of the Internet; the amount of information is rapidly increasing in
many different languages [1] and documents in Arabic language are part of this growth.
Therefore there is an increasing need for the retrieval, filtering and mining of Arabic
documents through the World Wide Web.