International Journal of Computer Applications (0975 – 8887) Volume 113 – No. 10, March 2015 18 Feature Selection for Effective Text Classification using Semantic Information Rajul Jain PG Student Department of Computer Engineering, Maharashtra Institute of Technology Pune, Pune, Maharashtra, India Nitin Pise Associate Professor Department of Computer Engineering, Maharashtra Institute of Technology Pune, Pune, Maharashtra, India ABSTRACT Text categorization is the task of assigning text or documents into pre-specified classes or categories. For an improved classification of documents text-based learning needs to understand the context, like humans can decide the relevance of a text through the context associated with it, thus it is required to incorporate the context information with the text in machine learning for better classification accuracy. This can be achieved by using semantic information like part-of-speech tagging associated with the text. Thus the aim of this experimentation is to utilize this semantic information to select features which may provide better classification results. Different datasets are constructed with each different collection of features to gain an understanding about what is the best representation for text data depending on different types of classifiers. General Terms Text Classification Keywords Context, POS tagging, semantic information, text categorization 1. INTRODUCTION The rebellious expansion of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Most of the information content available on the internet is in the form of text data hence it is imperative to deal with text data. Text mining generally refers to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text categorization is a crucial research field within text mining. The crucial objective of text categorization is to recognize, understand and organize the volumes of text data or documents. The main issues are the complexity of natural languages and the extremely high dimensionality of the feature space of documents that convolute this classification problem. Thus machine learning has a dual role: Firstly we need an efficient data representation to store and process the massive amount of data, as well as an efficient learning algorithm to solve the problem. Secondly, the accuracy and efficiency of the learning model should be high to classify unseen documents. The momentous advantages of this approach over the knowledge engineering approach (consisting of manual definition of a classifier by domain experts) are a very good efficacy, significant savings in terms of expert manpower, and the possibility of easy generalization (i.e. easy portability to different domains) [1]. The process of text categorization can be broadly understood through the steps shown in Figure 1.The document set first needs to be converted to a representation suitable for classification which requires a sequence of steps that have been discussed in detail in the literature survey. Figure 1: The process of text categorization After this step the classifier can be trained and hence evaluated later for unseen data samples. Thus the main issues are concerning three different problems, viz. data representation, classifier training and classifier performance evaluation. These tasks actually form the main phases of the life cycle of a text classification system and are discussed briefly ahead. 2. LITERATURE SURVEY A number of experiments have been performed to tackle the issues in text categorization. Here we can throw some light upon the subtasks involved in the process of text categorization along with the experiments done by many of the researchers: 2.1 Document Preprocessing A document by itself is just a collection of words and hence needs to be first preprocessed and converted into a form where it is usable as a dataset by a classifier generating algorithm. Hence a document or text is usually represented by an array of words called the feature set. So a document can be presented by a binary vector, assigning the value 1 if the document