S. Chaudhury et al. (Eds.): PReMI 2009, LNCS 5909, pp. 273–278, 2009. © Springer-Verlag Berlin Heidelberg 2009 Automatic Keyphrase Extraction from Medical Documents Kamal Sarkar Computer Science & Engineering Department, Jadavpur University, Kolkata – 700 032, India jukamal2001@yahoo.com Abstract. Keyphrases provide semantic metadata that summarizes the docu- ments and enable the reader to quickly determine whether the given article is in the reader’s fields of interest. This paper presents an automatic keyphrase ex- traction method based on the naive Bayesian learning that exploits a number of domain-specific features to boost up the keyphrase extraction performance in medical domain. The proposed method has been compared to a popular key- phrase extraction algorithm, called Kea. Keywords: Domain specific keyphrase extraction, Medical documents, Text mining, Naïve Bayes. 1 Introduction Medical Literature such as research articles, clinical trial reports, medical news avail- able on the web are the important sources to help clinicians in patient care. The per- vasion of huge amount of medical information through WWW has created a growing need for the development of techniques for discovering, accessing, and sharing knowledge from medical literature. The keyphrases help readers rapidly understand, organize, access, and share information of a document. Document keyphrases provide a concise summary of the document content. Medical research articles published in the journals generally come with several author assigned keyphrases. But, medical articles such as medical news, case reports, medical commentaries etc. may not have author assigned keyphrases. Sometimes, the number of author-assigned keyphrases available with the articles is too limited to represent the topical content of the articles. So, an automatic keyphrase extraction process is highly desirable. A number of previous works has suggested that document keyphrases can be useful in a various applications such as retrieval engines [1], [2], [3], browsing interfaces [4], thesaurus construction [5], and document classification and clustering [6]. Turney [7] treats the problem of keyphrase extraction as supervised learning task. Turney’s program is called Extractor. One form of this extractor is called GenEx, which is designed based on a set of parameterized heuristic rules that are fine-tuned using a genetic algorithm. A keyphrase extraction program called Kea, developed by Frank et al. [8], uses Bayesian learning for keyphrase extraction task. In both Kea and Extractor, the candidate