INFORMATION EXTRACTION FROM BIOMEDICAL TEXT: THE BIOTEXT PROJECT Filip Ginter, Tapio Pahikkala,Sampo Pyysalo, Evgeni Tsivtsivadze Jorma Boberg, Jouni J ¨ arvinen, Aleksandr Myll¨ ari and Tapio Salakoski Turku Centre for Computer Science (TUCS) and Dept. of IT, University of Turku Abstract We study information extraction for identifying protein-protein interactions stated in biomedical text. In this paper, we present an architecture for an information extraction system and discuss our improvements and results pertaining to several components of the system, including information retrieval, named entity recognition, syntactic analysis, and domain analysis. The individual results are discussed in the context of the whole system, and domain adaptations and differences from classical approaches are considered. We combine structural natural language processing with machine learning methods to address the general and domain-specific challenges of information extraction targeting protein-protein interactions. Keywords: biomedical literature mining, information retrieval, named entity recognition, word sense disambiguation, parsing, parse ranking 1. Introduction The amount of published knowledge in the biomedical domain is overwhelming and grows at an unprecedented rate. Although many databases collecting biomedical knowledge exist, their coverage is limited and manual identification of e.g. protein-protein interactions requires significant human effort. Freeform text remains a main source of information and thus Natural Language Processing (NLP) and Information Extraction (IE) methods are required to facilitate automated processing and structured access to the knowledge. The BioText project aims at developing NLP methods and resources for biomedical text mining as well as adapting existing methods to take into account the specific properties of the biomedical text domain. This paper gives an overview of our approach, the developed methods and the key results of the project. Our overall goal is the development of a modular system that processes biomedical text, such as abstracts contained in the PubMed literature database, and extracts the protein-protein interactions stated therein. The system consists of the following major subsystems: Information Retrieval (IR), Named Entity (NE) recognition, syntactic analysis, and pattern-based domain analysis. We apply machine learning approaches such as Bayesian classification, Support Vector Machines (SVM) (see e.g. Vapnik 1998) and Regularized Least-Squares (RLS) (see e.g. Poggio and Smale 2003) as well as