Protein-Protein Interaction Extraction: A Supervised Learning Approach Juan Xiao 1,2 , Jian Su 1 , GuoDong Zhou 1 and ChewLim Tan 2 1 Institute for Infocomm Research, Singapore 2 School of Computing, National University of Singapore; {stuxj, sujian, zhougd}@i2r.a-star.edu.sg, tancl@comp.nus.edu.sg Abstract In this paper, we propose using Maximum Entropy to extract protein-protein interaction information from the literature, which overcomes the limitation of the state of art co-occurrence based and rule-based approaches. It incorporates corpus statistics of various lexical, syntactic and semantic features. We find that the use of shallow lexical features contributes a large portion of performance improvements in contrast to the use of parsing or partial parsing information. Yet such lexical features have never been used before in other PPI extraction systems. As a result, such a new approach achieves a very encouraging result of 93.9% recall and 88.0% precision on IEPA corpus provided. To the best of our knowledge, not only is this the first systematic study of supervised learning and the first attempt of feature-based supervised learning for PPI extraction, but it also provides useful features, such as surrounding words, key words and abbreviations, to extend the supervised learning capability for relation extraction to other domains such as news. 1. Introduction Protein-protein interaction is becoming critical in the field of molecular biology due to demands for automatic discovery of molecular pathways and interactions in the literature. The goal of PPI extraction is to recognize various interactions, such as transcription, translation, post translational modification, complexing and dissociation between proteins, drugs, or other molecules from the biomedical literature. Due to the availability of the large MedLine abstract collection publicly available, most of the current work has been done on MedLine abstracts. Existing PPI works can be roughly divided into two categories: co-occurrence based approaches (Stapley and Benoit, 2000 and Shatkay et al., 2000) and rule- based approaches (Ono et al., 2001; Koike et al., 2003; Thomas et al., 2000; Friedman et al., 2001; Daraselia et al.). Co-occurrence based approaches simply use co- occurrence statistics of two proteins to predict their relation. In this way, they can only extract well-known PPIs but may not be able to find new emerging PPIs. On the other hand, rule-based approaches utilize pre- defined phrase pattern rules. As a result, they are unable to discover new phrase patterns without the known keywords. Once the rule set reaches a certain size, it is very difficult to insert additional rules for further performance improvement. Moreover, rule- based approaches may require redefining of the whole pattern rules when they are applied to a new domain. The protein-protein interaction extraction is a relation extraction task. In the relation extraction with news domain, some work has also been reported. Zelenko et al., 2003 utilize a kernel-based classification approach to extract relations by computing kernel functions between parse trees. Culotta and Sorensen, 2004 use a similar approach as Zelenko's method and further extend it to estimate kernel functions between augmented dependency trees. Due to the computation complexity, speed is still a serious problem for kernel approaches to be used in practical applications. Nanda, 2004 has proposed using Maximum Entropy Model to integrate lexical, syntactic and semantic features for relation detection and characterization (RDC) task containing 24 relation types on news articles with Automatic Content Extraction (ACE 1 , 2004), an evaluation conducted by NIST to measure information extraction technologies. It shows a better performance than Culotta and Sorensen, 2004 on ACE corpus. Inspired by Nanda’s work, we propose in this paper to use Maximum Entropy models to combine diverse lexical, syntactic and semantic features for PPI extraction. We would like to see how good it will be for PPI extraction, and what kinds of features are needed here and the corresponding contribution to the overall performance. On implementation, our system has shown very encouraging performance with 93.9% recall and 88.0% precision on the IEPA corpus. We also find that some features, including surrounding words feature, keyword feature and mention pairs which were not used in Nanda’s work, are very useful for PPI extraction. 1 http://www.nist.gov/speech/tests/ace