An interactive system for Extracting Arabic Lexicon from Arabic Newspaper Text Mohamed Ben Halima 1 and Adel M. Alimi 1 1 The high school of National Engineering of Sfax B.P W.3038 Sfax-Tunisia mohamed.benhlima@ieee.org and adel.alim@ieee.org Abstract We describe how to build a large comprehensive, integrated Arabic lexicon by automatic parsing of newspaper text. We have built a parser system to read Arabic newspaper articles, isolate the tokens from them, find the part of speech, and the features for each token. To achieve this goal we designed a set of algorithms, we generated several sets of rules, and we developed a set of techniques, and a set of components to carry out these techniques. As each sentence is processed, new words and features are added to the lexicon, so that it grows continuously as the system runs. To test the system we have used 75 articles (7 108 words) from the ASSAHAFA newspaper. The system consists of several modules: the tokenizer module to isolate the tokens, the type finder system to find the part of speech of each token, the proper noun phrase parser module to mark the proper nouns and to discover some information about them and the feature finder module to find the features of the words. 1. Introduction All natural language processing systems need a lexicon full of explicit information. A lexicon is considered to be the backbone of any natural language application. It is an essential basis for parsing, text generation, and question answering systems. The lexicon must contain a variety of information including all relevant feature values and relations. The fundamental problem of lexical acquisition is how to provide natural language systems with the full, adequate lexical knowledge they need to operate with the proper degree of efficiency. The answer to which the community is converging today is to extract the lexicon from the text itself. Much research on English and other languages currently concentrates on designing and building the lexicon automatically. One approach is to construct methods for automatic tagging of words in the document. The next step is to build rules for figuring out the features of those words automatically. The need for studying the lexicon of Arabic is particularly important because research in the Arabic language is unfortunately still not entirely up- to-date [1]. When we are dealing with newspaper articles that have a huge amount of text containing millions of words, we have a choice between two alternatives: Build a huge database of words manually and insert their features manually, which would take years and a large group of well-trained people with a solid education in how to analyze words and figure out their types and their features. Build rules and algorithms for designing and building a lexicon automatically that can analyze the document, tag the words in it, figure out their types, and their features automatically. If we choose the second alternative when we are dealing with the Arabic language, we are faced with two main challenges that are not present in English. These documents are full of proper nouns that need special rules to tag them in the text, because the Arabic language does not distinguish between lower and upper case letters. Upper case letters provide major help in marking the proper noun in English; they allow us to look for the capitalized letters in the text and start working around them. In Arabic there is no clear rule like this to guide us to find proper nouns, which leaves us with a big problem in recognizing them in Arabic text. Vowels are not written in the Arabic text we are using. (They are written only in books for children and in the holy Quran.) Different vowels change the meaning of the word. 978-1-4244-3397-1/08/$25.00 ©2008 IEEE 678