Arabic supervised learning method 157 Interactive Technology and Smart Education Vol. 5 No. 3, 2008 pp. 157-169 # Emerald Group Publishing Limited 1741-5659 DOI 10.1108/17415650810908249 Arabic supervised learning method using N-gram Majed Sanan Paris 8 University, Paris, France Mahmoud Rammal Lebanese University, Beirut, Lebanon, and Khaldoun Zreik Paris 8 University, Paris, France Abstract Purpose – Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center has to classify new documents based on these documents. This paper aims to study and explain the useful application of supervised learning method on Arabic texts using N-gram as an indexing method (n ¼ 3). Design/methodology/approach – The Lebanese official journal documents are categorized into several classes. Supposing that we know the class(es) of some documents (called learning texts), this can help to determine the candidate words of each class by segmenting the documents. Findings – Results showed that N-gram text classification using the cosine coefficient measure outperforms classification using Dice’s measure and TF*ICF weight. Then it is the best between the three measures but it still insufficient. N-gram method is good, but still insufficient for the classification of Arabic documents, and then it is necessary to look at the future of a new approach like distributional or symbolic approach in order to increase the effectiveness. Originality/value – The results could be used to improve Arabic document classification (using software also). This work has evaluated a number of similarity measures for the classification of Arabic documents, using the Lebanese parliament documents and especially the Lebanese official journal documents Arabic corpus as the test bed. Keywords Classification, Learning methods, Languages, Text retrieval, Lebanon Paper type Research paper 1. Introduction The rapid growth of the internet has increased the number of online documents available. This has led to the development of automated text and document classification systems that are capable of automatically organizing and classifying documents. Text classification (or categorization) is the process of structuring a set of documents according to a group structure that is known in advance. There are several different methods for text classification, including statistical-based algorithms, Bayesian classification, distance-based algorithms, k-nearest neighbors, decision tree- based methods, etc. Text classification techniques are used in many applications, including e-mail filtering, mail routing, spam filtering, news monitoring, sorting through digitized paper archives, automated indexing of scientific articles, classification of news stories, and searching for interesting information on the web. The majority of these systems is designed to handle documents written in non- Arabic language, developing text classification systems for Arabic documents is a challenging task due to the complex and rich nature of the Arabic language. The Arabic language consists of 28 letters. The language is written from right to left. It has very complex morphology, and the majority of words have a tri-letter root. The rest have either a quad-letter root, penta-letter root, or hexa-letter root. The current issue and full text archive of this journal is available at www.emeraldinsight.com/1741-5659.htm