A NOVEL MODEL FOR TEXT DOCUMENT REPRESENTATION: APPLICATION ON OPINION MINING DATASETS ASMAA MOUNTASSIR, HOUDA BENBRAHIM & ILHAM BERRADA ALBIRONI Research Team, ENSIAS, Mohamed 5 University, Souissi, Rabat, Morocco ABSTRACT In this paper, we propose a novel model for Document Representation in an attempt to address the problem of huge dimensionality and vector sparseness that are commonly faced in Text Classification tasks. The proposed model consists of representing text documents in the space of training documents. To evaluate the effectiveness of our model, we focus on a problem of binary classification. We conduct our experiments on Arabic and English data sets of Opinion Mining. We use as classifiers Support Vector Machines (SVM) and k-Nearest Neighbors (kNN). We compare the performance of our model with that of the classical Vector Space Model (VSM) by the consideration of three evaluative criteria, namely dimensionality of the generated vectors, time taken by the classifiers, and classification results in terms of accuracy. Our experiments show that the effectiveness of our model depends on the used classifier. Results yielded by k- NN when applying our model are the same as those obtained when applying the classical VSM. For SVM, results yielded when applying our model are, in general, slightly lower than those obtained when using VSM. However, the gain in terms of time and dimensionality reduction is so promising since they are dramatically decreased by the application of our model. KEYWORDS: Document Representation, Text Classification, Opinion Mining, Machine Learning, Natural Language Processing INTRODUCTION With the increasing amount of available text documents in digital forms (either on the web or in databases), the need to automatically organize and classify these documents becomes more important and, at the same time, more challenging. We can find a wide range of domains in which we use Text Classification techniques. Among these domains we find Categorization by Topic (Saad & Ashour, 2010), Opinion Mining (Mountassir et al., 2012a), Recommendation Systems (Li et al., 2010), Question Answering (Yu & Hatzivassiloglou, 2003) and Spam Detection (Prilepok et al., 2012). Automated Text Classification (TC) is a supervised learning task that consists of assigning some pre-defined category labels to new documents (called test documents) on the basis of the likelihood suggested by labeled documents (called training documents). A growing number of machine learning methods are applied to this problem, including Naïve Bayes, Decision Trees, Support Vector Machines, and k-Nearest Neighbors (Sebastiani, 2002). As text documents cannot be directly interpretable by such learning algorithms, we need to represent these documents by the use of the Vector Space Model (VSM) (Salton, 1989). VSM consists of generating for each document its corresponding feature vector. Given a feature set, and for a given document, the generated vector gives to each feature its weight with respect to the document. Note that a feature weight used to measure how important is the feature regarding the document. At this stage, two issues are to be addressed. The first issue is how to build the feature set. The second issue is how to weight these features. One conventional way to construct t he feature set is to consider documents as “bags-of-words”, where the features International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 3, Aug 2013, 293-304 © TJPRC Pvt. Ltd.