ISSN: 2277-9655
[Tilve* et al., 6(2): February, 2017] Impact Factor: 4.116
IC™ Value: 3.00 CODEN: IJESS7
http: // www.ijesrt.com © International Journal of Engineering Sciences & Research Technology
[513]
IJESRT
INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH
TECHNOLOGY
A SURVEY ON MACHINE LEARNING TECHNIQUES FOR TEXT
CLASSIFICATION
Amey K. Shet Tilve*, Surabhi N. Jain
*
Department of Computer Engineering Don Bosco College of Engineering
Margao, India
Creative Capsule InfoTech Verna, India
DOI: 10.5281/zenodo.322477
ABSTRACT
This research focuses on Text Classification. Text classification is the task of automatically sorting a set of
documents into categories from a predefined set. The domain of this research is the combination of information
retrieval (IR) technology, Data mining and machine learning (ML) technology. This research will outline the
fundamental traits of the technologies involved. This research uses three text classification algorithms (Naive
Bayes, VSM for text classification and the new technique -Use of Stanford Tagger for text classification) to
classify documents into different categories, which is trained on two different datasets (20 Newsgroups and New
news dataset for five categories).In regards to the above classification strategies, Naïve Bayes is potentially good
at serving as a text classification model due to its simplicity.
KEYWORDS: Text Classification, Information Retrieval, Naive Bayes Classifier, Vector Space Model Text
Classification, Part of Speech Tagging, Natural Language Processing.
INTRODUCTION
The text mining studies are gaining more importance recently because of the availability of the increasing number
of the electronic documents from a variety of sources. Which include unstructured and semi structured
information. The main goal of text mining is to enable users to extract information from textual resources and
deals with the operations like, retrieval, classification (supervised, unsupervised and semi supervised) and
summarization Natural Language Processing (NLP), Data Mining, and Machine Learning techniques work
together to automatically classify the documents and discover patterns from different types of the documents .
Text classification (TC) is an important part of text mining, looked to be that of manually building automatic TC
systems by means of knowledge-engineering techniques, i.e. manually defining a set of logical rules also called
as training , that convert expert knowledge on how to classify documents under the given set of categories. For
example would be to automatically label each incoming news with a topic like “sports”, “politics”, or “business”.
A data mining classification task starts with a training set D = (d1….. dn) of documents that are already labeled
with a class C1, C2 (e.g. sport, politics). The task is then to determine a classification model which is able to
assign the correct class to a new document d of the domain.
Basically there are two stages involved in Text Classification. Training stage and testing stage. As explained in
the above paragraph, in training stage documents are pre-processed and are trained by a learning algorithm to
generate the classifier. In testing stage, a validation of classifier is performed. There are many traditional learning
algorithms to train the data, such as Decision trees, Naïve-Bayes (NB), Support Vector Machines (SVM), k-
Nearest Neighbor (kNN), Neural Network (NNet),etc.
In this research, we study the problem of text classification, that is classifying the news documents into different
categories based on three different supervised algorithms namely Naive Bayes classifier, Vector Space Model for
text classification and a new technique -Use of Stanford Tagger for text classification. We have tried to compare
the efficiency and accuracy of the algorithms to analyze the effectiveness of each algorithm. The research has
been carried out on two different datasets namely 20Newsgroup and New Dataset of news for five categories.