A NOVEL MODEL FOR TEXT DOCUMENT REPRESENTATION: APPLICATION ON
OPINION MINING DATASETS
ASMAA MOUNTASSIR, HOUDA BENBRAHIM & ILHAM BERRADA
ALBIRONI Research Team, ENSIAS, Mohamed 5 University, Souissi, Rabat, Morocco
ABSTRACT
In this paper, we propose a novel model for Document Representation in an attempt to address the problem of
huge dimensionality and vector sparseness that are commonly faced in Text Classification tasks. The proposed model
consists of representing text documents in the space of training documents. To evaluate the effectiveness of our model, we
focus on a problem of binary classification. We conduct our experiments on Arabic and English data sets of Opinion
Mining. We use as classifiers Support Vector Machines (SVM) and k-Nearest Neighbors (kNN). We compare the
performance of our model with that of the classical Vector Space Model (VSM) by the consideration of three evaluative
criteria, namely dimensionality of the generated vectors, time taken by the classifiers, and classification results in terms of
accuracy. Our experiments show that the effectiveness of our model depends on the used classifier. Results yielded by k-
NN when applying our model are the same as those obtained when applying the classical VSM. For SVM, results yielded
when applying our model are, in general, slightly lower than those obtained when using VSM. However, the gain in terms
of time and dimensionality reduction is so promising since they are dramatically decreased by the application of our model.
KEYWORDS: Document Representation, Text Classification, Opinion Mining, Machine Learning, Natural Language
Processing
INTRODUCTION
With the increasing amount of available text documents in digital forms (either on the web or in databases), the
need to automatically organize and classify these documents becomes more important and, at the same time, more
challenging. We can find a wide range of domains in which we use Text Classification techniques. Among these domains
we find Categorization by Topic (Saad & Ashour, 2010), Opinion Mining (Mountassir et al., 2012a), Recommendation
Systems (Li et al., 2010), Question Answering (Yu & Hatzivassiloglou, 2003) and Spam Detection (Prilepok et al., 2012).
Automated Text Classification (TC) is a supervised learning task that consists of assigning some pre-defined
category labels to new documents (called test documents) on the basis of the likelihood suggested by labeled documents
(called training documents). A growing number of machine learning methods are applied to this problem, including Naïve
Bayes, Decision Trees, Support Vector Machines, and k-Nearest Neighbors (Sebastiani, 2002).
As text documents cannot be directly interpretable by such learning algorithms, we need to represent these
documents by the use of the Vector Space Model (VSM) (Salton, 1989). VSM consists of generating for each document its
corresponding feature vector. Given a feature set, and for a given document, the generated vector gives to each feature its
weight with respect to the document. Note that a feature weight used to measure how important is the feature regarding the
document. At this stage, two issues are to be addressed. The first issue is how to build the feature set. The second issue is
how to weight these features.
One conventional way to construct t he feature set is to consider documents as “bags-of-words”, where the features
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 3, Aug 2013, 293-304
© TJPRC Pvt. Ltd.