International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8, Issue-10, August 2019
3169
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication
Retrieval Number J95200881019/19©BEIESP
DOI: 10.35940/ijitee.J9520.0881019
Abstract: With the advent of digital era, billions of the
documents generate every day that need to be managed,
processed and classified. Enormous size of text data is available
on world wide web and other sources. As a first step of managing
this mammoth data is the classification of available documents
in right categories. Supervised machine learning approaches try
to solve the problem of document classification but working on
large data sets of heterogeneous classes is a big challenge.
Automatic tagging and classification of the text document is a
useful task due to its many potential applications such as
classifying emails into spam or non-spam categories, news
articles into political, entertainment, stock market, sports news,
etc. The paper proposes a novel approach for classifying the text
into known classes using an ensemble of refined Support Vector
Machines. The advantage of proposed technique is that it can
considerably reduce the size of the training data by adopting
dimensionality reduction as pre-training step. The proposed
technique has been used on three bench-marked data sets
namely CMU Dataset, 20 Newsgroups Dataset, and Classic
Dataset. Experimental results show that proposed approach is
more accurate and efficient as compared to other state-of-the-art
methods.
Keywords: Text classification, support vector machine,
non-linear ensemble, machine learning, natural language
processing.
I. INTRODUCTION
With the advent of digital era, billions of the documents
generate every day that need to be managed, processed and
classified. Enormous size of text data is available on world
wide web and other sources. As a first step of managing this
mammoth data is the classification of available documents in
right categories. Supervised machine learning approaches try
to solve the problem of document classification but working
on large data sets of heterogeneous classes is a big challenge.
Automatic tagging and classification of the text document is
a useful task due to its many potential applications such as
classifying emails into spam or non-spam categories, news
articles into political, entertainment, stock market, sports
news, etc. The paper proposes a novel approach for
classifying the text into known classes using an ensemble of
refined Support Vector Machines. The advantage of
proposed technique is that it can considerably reduce the size
of the training data by adopting dimensionality reduction as
Revised Manuscript Received on August 05, 2019.
Dr Sheelesh Kumar Sharma, Professor (Comp. Sc.), IMS Ghaziabad,
India.
Mr Navel Kishor Sharma, Associate Dean, Academic City College
Ghana.
pre-training step. The proposed technique has been used on
three bench-marked data sets namely CMU Dataset, 20
Newsgroups Dataset, and Classic Dataset. Experimental
results show that proposed approach is more accurate and
efficient as compared to other state-of-the-art methods.
II. LITERATURE SURVEY
The area of text mining has been popular among
researchers for quite a long time. In a classic survey work of
Berry [1], clustering, classification and retrieval of text data
have been discussed along with various other concepts of text
mining. Hotho et al [2] also present a survey on text mining
along with various pre-processing steps and algorithms.
Text classification has many interesting applications such as
content management, fraud detection in banking, sentiment
analysis, customer reviews and feedback analysis, search
engine optimization, biomedical analysis etc[3]-[7].
Text classification is a supervised learning task. Many
approaches have been deployed for performing it.
Traditionally, the approaches that can be found in literature
for text classification include naive Bayes classifier,
k-nearest neighbors, artificial neural network, evolutionary
approaches, support vector machines, decision trees etc
[8]-[11]. The training of the classifier can be either feature
based or end-to-end learning without the need of the step of
feature extraction. Provided with the huge volume of data,
dimensionality reduction step can substantially reduce its
size. There are many approaches for dimensionality
reduction. Two popular approaches are linear discriminant
analysis (LDA) and principal component analysis (PCA). It
is always computational efficient to work with reduced data
as compared to the entire data in raw form.
Deep learning based methods have been very effective
especially in visual and textual pattern recognition
tasks[12]-[13]. These approaches can be either feature based
or can use end-to-end learning without the need of feature
extraction. The end-to-end learning variant of deep learning
is highly popular. One limitation with deep learning based
methods is that they require lots of data and computation
resources. There are many deep learning models. The most
popular model is convolutional neural network (CNN). To
overcome the shortcomings of CNN, some advanced models
like recurrent neural network (RNN), long short term
memory (LSTM) network- a variant of RNN etc[12]-[13]
exist. Still, there is one
Text Classification Using Ensemble Of
Non-Linear Support Vector Machines
Sheelesh Kumar Sharma, Navel Kishor Sharma