International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-10, August 2019 3169 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number J95200881019/19©BEIESP DOI: 10.35940/ijitee.J9520.0881019 Abstract: With the advent of digital era, billions of the documents generate every day that need to be managed, processed and classified. Enormous size of text data is available on world wide web and other sources. As a first step of managing this mammoth data is the classification of available documents in right categories. Supervised machine learning approaches try to solve the problem of document classification but working on large data sets of heterogeneous classes is a big challenge. Automatic tagging and classification of the text document is a useful task due to its many potential applications such as classifying emails into spam or non-spam categories, news articles into political, entertainment, stock market, sports news, etc. The paper proposes a novel approach for classifying the text into known classes using an ensemble of refined Support Vector Machines. The advantage of proposed technique is that it can considerably reduce the size of the training data by adopting dimensionality reduction as pre-training step. The proposed technique has been used on three bench-marked data sets namely CMU Dataset, 20 Newsgroups Dataset, and Classic Dataset. Experimental results show that proposed approach is more accurate and efficient as compared to other state-of-the-art methods. Keywords: Text classification, support vector machine, non-linear ensemble, machine learning, natural language processing. I. INTRODUCTION With the advent of digital era, billions of the documents generate every day that need to be managed, processed and classified. Enormous size of text data is available on world wide web and other sources. As a first step of managing this mammoth data is the classification of available documents in right categories. Supervised machine learning approaches try to solve the problem of document classification but working on large data sets of heterogeneous classes is a big challenge. Automatic tagging and classification of the text document is a useful task due to its many potential applications such as classifying emails into spam or non-spam categories, news articles into political, entertainment, stock market, sports news, etc. The paper proposes a novel approach for classifying the text into known classes using an ensemble of refined Support Vector Machines. The advantage of proposed technique is that it can considerably reduce the size of the training data by adopting dimensionality reduction as Revised Manuscript Received on August 05, 2019. Dr Sheelesh Kumar Sharma, Professor (Comp. Sc.), IMS Ghaziabad, India. Mr Navel Kishor Sharma, Associate Dean, Academic City College Ghana. pre-training step. The proposed technique has been used on three bench-marked data sets namely CMU Dataset, 20 Newsgroups Dataset, and Classic Dataset. Experimental results show that proposed approach is more accurate and efficient as compared to other state-of-the-art methods. II. LITERATURE SURVEY The area of text mining has been popular among researchers for quite a long time. In a classic survey work of Berry [1], clustering, classification and retrieval of text data have been discussed along with various other concepts of text mining. Hotho et al [2] also present a survey on text mining along with various pre-processing steps and algorithms. Text classification has many interesting applications such as content management, fraud detection in banking, sentiment analysis, customer reviews and feedback analysis, search engine optimization, biomedical analysis etc[3]-[7]. Text classification is a supervised learning task. Many approaches have been deployed for performing it. Traditionally, the approaches that can be found in literature for text classification include naive Bayes classifier, k-nearest neighbors, artificial neural network, evolutionary approaches, support vector machines, decision trees etc [8]-[11]. The training of the classifier can be either feature based or end-to-end learning without the need of the step of feature extraction. Provided with the huge volume of data, dimensionality reduction step can substantially reduce its size. There are many approaches for dimensionality reduction. Two popular approaches are linear discriminant analysis (LDA) and principal component analysis (PCA). It is always computational efficient to work with reduced data as compared to the entire data in raw form. Deep learning based methods have been very effective especially in visual and textual pattern recognition tasks[12]-[13]. These approaches can be either feature based or can use end-to-end learning without the need of feature extraction. The end-to-end learning variant of deep learning is highly popular. One limitation with deep learning based methods is that they require lots of data and computation resources. There are many deep learning models. The most popular model is convolutional neural network (CNN). To overcome the shortcomings of CNN, some advanced models like recurrent neural network (RNN), long short term memory (LSTM) network- a variant of RNN etc[12]-[13] exist. Still, there is one Text Classification Using Ensemble Of Non-Linear Support Vector Machines Sheelesh Kumar Sharma, Navel Kishor Sharma