International Journal of Computer Applications Technology and Research Volume 7–Issue 02, 45-52, 2018, ISSN:-2319–8656 www.ijcat.com 45 A Hybrid Approach of Association Rule & Hidden Makov Model to Improve Efficiency Medical Text Classification Huda Ali Al-qozani Department of Computer Science University of Thamar Thamar, Yemen Khalil saeed Al-wagih Department of Information Technology University of Thamar Thamar, Yemen Abstract: Text classification problem is a set of documents be classified into a predeﬁned set of categories, each document is classified based on a set of features (words). However, some of the words not relevant to a category which causes a gap between words relevance in a document. A lot of research articles in public databases, and The digitization of critical medical information such as lab reports, patients records, research papers, and anatomic images tremendous amounts of biomedical research data are generated every day. So that, the classification this data and retrieving information relevant to information users’ needs have been a prima ry research issue in the ﬁeld of Information Retrieval, and the adoption of classiﬁcation has been applied to tackle this particular problem. In this paper, we propose a hybrid model for the classiﬁcation of biomedical texts according to their content, using Association Rule s and Hidden Markov Model as classifier. In order to demonstrate it, we present a set of experiments performed on OHSUMED biomedical text corpora. Our classiﬁer compared with Naive Bayes and Support Vector Machine models. The evaluation result shows that the proposed classiﬁcation is complete and accurate when compared with Naive Bayes and Support Vector Machine models. Keywords: Hidden Markov Model, Association Rules, Biomedical Text, Text Classification, Machine learning, Text mining, Information Retrieved. 1. INTRODUCTION The ﬁeld of biomedical informatics has drawn increasing attention and has been growing rapidly. The amounts of biomedical research data are generated every day in public databases such as OHSUMED or elsewhere, has come to a growing realization that such data contains buried within it knowledge, knowledge that could lead to important discoveries in science, knowledge that could enable us accurately to predict the diseases. The knowledge that could enable us to identify the causes of and possible cures for lethal illnesses, a knowledge that could literally mean the difference between life and death. It has rightly been said that the world is becoming ‘data rich but knowledge poor’, These data need to be effectively organized and analyzed in order to be useful [18]. In the another side knowledge management practices often need to leverage existing clinical decision support, information retrieval (IR), and digital library techniques to capture and deliver tacit and explicit biomedical knowledge. Text mining techniques have been used to analyses research publications as well as electronic patient records [9]. The task of automatic classiﬁcation is a relatively new IR sub ﬁeld. Since Machine Learning (ML) serves as a theoretical foundation for the methodologies in this task, its scope is often referred to as the intersection of IR and ML[46]. Text classiﬁcation (TC) may be formalized as the task of approximating the unknown target function f : D x C { - 1 , 1} that corresponds to how documents would be classiﬁed . The function f is the text classiﬁer, C = { c1,c2,… ,cj,... ,c |C|} is a pre-deﬁned set of categories and D is a set of documents. Each document is represented using the set of features, usually words, W = { w1, w2, . . . ,wk, . . . , wW } , with each one as a vector di = { wi1, wi2, . . . ,wik, . . . , wi | W |}, where wik describes each feature’s representation for that speciﬁc document. When f (di,cj)= 1, di is a positive example or member of category cj , whilst when f (di,cj) = 0 it is a negative example of cj. The goal of this paper is to categorize electronic biomedical texts to one or more categories automatically[39]. The following part moves on to describe the methods used in different aspects of TC. The Naive Bayes (NB) model has been one of the more popular methods used in TC due to its simplicity and relative effectiveness [7, 27, 30]. However, the performance of the NB model has turned out to be inferior to other models such as Support Vector Machine (SVM) [19], k-Near Neighbor (KNN) [43], Neural Network (NN ) [44]. The outcome of many studies confirms that there is no single TC model instead. Distinct models seem to be robust for different aspects of TC and within different contexts such as KNN-based models are easily scalable to large data sets [43], NN-based are best suitable for applications to obscure intrinsic structures [37], NB-based are appropriate for their simplicity and extensibility to web documents with links [26] and SVM-based may be used for their resistance to over-fitting and large dimensionality [14]. Hidden Markov Model (HMM) has been used to describe a sequential random process[41, 2]. Association Rule Mining (ARM) is to examine the contents of the database and ﬁnd rules[7]. Another significant aspect of this study, the surveys of biomedical text mining [50, 49], journal [8], and book [3] indicate that general purpose text and data mining tools are not well-suited for the biomedical domain. The biomedical domain is highly specialized, but biomedical information is being created in text forms [40]. In this paper, a hybrid association rule and hidden markov model (AR-HMM) is investigated to prove the effectiveness of the proposed method, it is compared with SVM and NB. Rest of this paper is organized as follows: section 2, describes the methods and materials which used in this study, also present the performance measurements which are used to evaluate the categorization models. section 3, the results and discussion are presented, then reviews the most related work of Hidden markov model and association rules. in section 4,