An interactive system for Extracting Arabic Lexicon from Arabic Newspaper
Text
Mohamed Ben Halima
1
and Adel M. Alimi
1
1
The high school of National Engineering of Sfax
B.P W.3038 Sfax-Tunisia
mohamed.benhlima@ieee.org and adel.alim@ieee.org
Abstract
We describe how to build a large comprehensive,
integrated Arabic lexicon by automatic parsing of
newspaper text. We have built a parser system to read
Arabic newspaper articles, isolate the tokens from
them, find the part of speech, and the features for each
token. To achieve this goal we designed a set of
algorithms, we generated several sets of rules, and we
developed a set of techniques, and a set of components
to carry out these techniques. As each sentence is
processed, new words and features are added to the
lexicon, so that it grows continuously as the system
runs. To test the system we have used 75 articles (7
108 words) from the ASSAHAFA newspaper. The
system consists of several modules: the tokenizer
module to isolate the tokens, the type finder system to
find the part of speech of each token, the proper noun
phrase parser module to mark the proper nouns and to
discover some information about them and the feature
finder module to find the features of the words.
1. Introduction
All natural language processing systems need a
lexicon full of explicit information. A lexicon is
considered to be the backbone of any natural language
application. It is an essential basis for parsing, text
generation, and question answering systems. The
lexicon must contain a variety of information including
all relevant feature values and relations. The
fundamental problem of lexical acquisition is how to
provide natural language systems with the full,
adequate lexical knowledge they need to operate with
the proper degree of efficiency. The answer to which
the community is converging today is to extract the
lexicon from the text itself. Much research on English
and other languages currently concentrates on
designing and building the lexicon automatically. One
approach is to construct methods for automatic tagging
of words in the document. The next step is to build
rules for figuring out the features of those words
automatically. The need for studying the lexicon of
Arabic is particularly important because research in the
Arabic language is unfortunately still not entirely up-
to-date [1].
When we are dealing with newspaper articles that
have a huge amount of text containing millions of
words, we have a choice between two alternatives:
Build a huge database of words manually and
insert their features manually, which would take
years and a large group of well-trained people with
a solid education in how to analyze words and
figure out their types and their features.
Build rules and algorithms for designing and
building a lexicon automatically that can analyze
the document, tag the words in it, figure out their
types, and their features automatically.
If we choose the second alternative when we are
dealing with the Arabic language, we are faced with
two main challenges that are not present in English.
These documents are full of proper nouns that
need special rules to tag them in the text, because
the Arabic language does not distinguish between
lower and upper case letters. Upper case letters
provide major help in marking the proper noun in
English; they allow us to look for the capitalized
letters in the text and start working around them.
In Arabic there is no clear rule like this to guide us
to find proper nouns, which leaves us with a big
problem in recognizing them in Arabic text.
Vowels are not written in the Arabic text we are
using. (They are written only in books for children
and in the holy Quran.) Different vowels change
the meaning of the word.
978-1-4244-3397-1/08/$25.00 ©2008 IEEE
678