Extracting Gene Function Descriptions by Probability-based Sentence Selection Kazuhiro Seki, Nihar Sheth, and Javed Mostafa Laboratory for Applied Informatics Research, Indiana University 1320 East Tenth Street, LI 011, Bloomington, Indiana 47405-3907, USA {kseki,nisheth,jm}@indiana.edu Abstract This paper presents an approach to the secondary task of the TREC Genomics Track. We regard the task as identi- fication of the sentences which describe gene functions (i.e., GeneRIFs), and propose a method from two perspectives: automatic summarization, assuming given articles mainly re- port gene functions, and question answering (QA), assuming a question asking gene functions. From the point of view of automatic summarization, we use word frequencies in a given article and location information to identify topic sen- tences. In terms of QA, we use word frequencies in Gene- RIFs to find a set of words frequently used in describing gene functions. We formalize a probabilistic model combining these multiple information sources. Our method is evalu- ated on the test set of 139 MEDLINE abstracts, and the re- sults demonstrate the following; the distributions of function words in input have some clues to identify gene function de- scriptions; there is a vocabulary peculiar to GeneRIFs; and location information gives the highest predictive power for this specific task. Additionally, we examine some alterna- tive methods and their eectiveness in comparison with our method. 1 Introduction The volume of publications in the biological domain has been rapidly growing, making it dicult for individual re- searchers to keep themselves updated. This resulted in a strong demand for information retrieval (IR) and informa- tion extraction (IE) techniques which could help us manage the information overload. To foster the IR and IE research in the area of biology, the Genomics Track will be held at the Text REtrival Con- ference (TREC) 2003 for the first time (Hersh, 2002). TREC is one of the major conferences targeting IR and has been contributing to the development of current IR research and several related areas, e.g., question answering (QA) and fil- tering since it first started in 1992. The Genomics Track aiming at IR and IE, reflecting the increasing interest in the practical applications of those tech- niques to the biological literature. This year, the Genomics Track oers two independent tasks for IR and IE, called the primary and secondary tasks, respectively. In short, the primary task asks the participants to find MEDLINE arti- cles stating the functions associated with given gene names, while the secondary task aims at automatically generating concise descriptions of gene functions stated in given re- search articles. Our research group targeted the secondary task. The rest of this paper is organized as follows. Section 2 overviews the secondary task. Section 3 summarizes the past research related to the task. Section 4 describes our proposed method for identifying gene function descriptions. Section 5 reports a series of experiments carried out to evaluate our method. Section 6 compares our method with alternative ap- proaches. Lastly, Section 7 concludes this paper with a brief summary and possible directions for future research. 2 Overview – the Secondary Task The secondary task targets information extraction (IE) from the biological literature. Specifically, it aims at generating descriptions related to gene functions in an automated way. For this year, the Track Steering Committee decided to ex- perimentally make use of GeneRIF (Gene References into Function) entries as the gold standard, which are included in the LocusLink database (Pruitt and Maglott, 2001) main- tained by National Center for Biotechnology Information (NCBI). GeneRIFs are functional annotations of genes and, accord- ing to the NCBI web page 1 , is defined as “a concise phrase describing a function or functions (less than 255 characters in length, preferably more than a restatement of the title of the paper).” They have been mainly annotated by experts in the life sciences at National Library of Medicine (NLM). 1 http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html