International Journal of Scientific Engineering and Research (IJSER) www.ijser.in ISSN (Online): 2347-3878, Impact Factor (2014): 3.05 Volume 3 Issue 6, June 2015 Licensed Under Creative Commons Attribution CC BY Content Based Text Classification Using Morkov Models Khalid Hussain Zargar 1 , Manzoor Ahmad Chachoo 2 Department of computer Science, Mewar University, Gangrar Chitorgarh, Rajasthan-312901 2 Department of Computer Science, University of Kashmir, Hazratbal Srinagar-190001 Abstract: Text categorization is the task of assigning predefined category to a set of documents. Several different models like SVM, Naïve Bayes, KNN have been used in the past. In this paper we present another approach to automatically assign a category to a document. Our approach is based on the use of Markov Models. We consider text as bag of words and use Hidden Markov Model to assign the most appropriate category to the text. The proposed approach is based on the fact that while creating documents the user uses the specific vocabulary related to the particular category. Hidden Morkov models have been widely used in automatic speech recognition, part of speech tagging, and information extraction but has not been used extensively for text categorization. Keywords: Text Classification, Information gain, HMM, Text Processing, Viterbi Algorithm, Precision, Recall. 1. Introduction The problem of text classification has been a recent research area in the field of library science, computer science and in many other areas. Now a days having a large number of digitized text material manual classification becomes almost impractical consuming a lot of time and resources. Now is the time to find new and automatic approaches for text classification. In the recent past various methods of automatically classifying the text have been developed using machine learning algorithms like Naïve Bayes, Artificial Neural Networks, KNN, SVM, etc...A machine learning algorithm takes as input a set of labeled example documents (where the label indicates which category the example belongs to) and attempts to infer a function that will map new documents into their categories. In this paper we describe the process of automatic text classification based on Hidden Morkov model. “A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.” This approach of text categorization takes the same method of representing the documents as bag of words. HMMs has been used in many text and speech related applications like information retrieval, information extraction, text summarization but has been widely used for text classification. The purpose of this approach is to consider only the content of the document and not the structure of the document for classification purposes. 2. Hidden Morkov Model “A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.” An HMM can be defined as a 5-tuple (S, V, π, A, B). S: Number of states in the Model (The catagories in our case). There is a finite set of states in a model. The states in an HMM are hidden, but there is a lot of significance to these states in defining an HMM. We denote the individual states as S1, S2, S3, …, Sn. S = { S1, S2, S3, , Sn}. V: Number of distinct symbols observable in states. These symbols correspond to the observable output of the system that is being modeled. We denote the individual states as v1, v2, v3, …, vM. V = { v1, v2, v3, …, vM}. A: State transition probability distribution A is transition array that store the state transition probabilities. A={aij }, where aij stores the probability of state j following state i. aij= P(qt=Sj/qt-1=Si), i ≥1 and j N the probability of moving from state Si to Sj at time t At each time t, a new state is entered which depends on the transition probability distribution of the state at time t1 .Transition to the same state is also possible. An important point about transition probabilities is that they are independent of time; the probability of moving from state Si to state Sj is independent of time t. B: Observation symbol probability distribution B = {bj(k)} is the output symbol array that stores the probability of an observation Vk being produced from the state j, independent of time t. Observation symbol probability or Output emission Probability estimates are also independent of time. The probability of a state emitting a particular output symbol does not vary with time. B = { bj(k) } , bi(k) = P(xt = vk/qt = Sj) 1≤ j N and 1≤ k M bi(k), the probability of emitting symbol vk, when state Sj is entered at time t. After each transition is made, a symbol is Paper ID: IJSER15239 48 of 52