Mining the Risk Types of Human Papillomavirus (HPV) by AdaCost S.-B. Park, S. Hwang, and B.-T. Zhang School of Computer Science and Engineering Seoul National University 151-742 Seoul, Korea {sbpark,shhwang,btzhang}@bi.snu.ac.kr Abstract. Human Papillomavirus (HPV) infection is known as the main factor for cervical cancer, where cervical cancer is a leading cause of cancer deaths in women worldwide. Because there are more than 100 types in HPV, it is critical to discriminate the HPVs related with cervi- cal cancer from those not related with it. In this paper, we classify the risk type of HPVs using their textual explanation. The important issue in this problem is to distinguish false negatives from false positives. That is, we must find out high-risk HPVs though we may miss some low-risk HPVs. For this purpose, the AdaCost, a cost-sensitive learner is adopted to consider different costs between training examples. The experimen- tal results on the HPV sequence database show that considering costs gives higher performance. The F-score is higher than the accuracy, which implies that most high-risk HPVs are found. 1 Introduction Human papillomavirus (HPV) is a double-strand DNA tumor virus that belongs to the papovavirus family, and there are more than 100 types in HPV that are specific for epithelial cells including skin, respiratory mucosa, and the genital tract. Especially, the genital tract HPV types are classified by their relative malignant potential into low-, and high-risk types [6]. The common, unifying oncogenic feature of the vast majority of cervical cancers is the presence of high- risk HPV. Therefore, the most important thing for diagnosis and therapy is discriminating what HPV types are high-risk. One way to discriminate the risk types of HPVs is using a text mining tech- nique. Since a great number of research results on HPV have been already re- ported in biomedical journals [4,5], they can be used as a source of discriminating HPV risk types. One problem in discriminating the risk types is that it is impor- tant to distinguish false negatives from false positives. That is, it is not critical to classify the low-risk HPVs as high-risk ones, because they can be investigated by further empirical study. However, it is fatal to classify the high-risk HPVs as low-risk ones. In this case, dangerous HPVs can be missed, and there is no further chance to detect cervical cancer by them. V. Maˇ r´ ık et al. (Eds.): DEXA 2003, LNCS 2736, pp. 403–412, 2003. c Springer-Verlag Berlin Heidelberg 2003