Summarizing Biological Literature with BioSumm Elena Baralis Dipartimento di Automatica e Informatica Politecnico di Torino, Torino, Italy elena.baralis@polito.it Alessandro Fiori Dipartimento di Automatica e Informatica Politecnico di Torino, Torino, Italy alessandro.fiori@polito.it ABSTRACT BioSumm is a summarization environment that supports user queries on online repositories of scientific publications by providing abstract descriptions of focused document groups. The summarization approach is driven by a grading function which evaluates the occurrences of domain dictionary terms. The demonstrated system enables users to query and down- load research papers from online databases (e.g., PubMed) and local repositories. The (possibly large) retrieved doc- ument collection is then partitioned into document clusters devoted to homogeneous topics. Finally, documents in a cluster are summarized by extracting sentences relevant for a specific application domain. In the demo the considered domain is the interaction of human genes and proteins. Categories and Subject Descriptors I.5.4 [Pattern Recognition]: Applications—Text process- ing ; J.3 [Life and Medical Sciences]: Biology and genet- ics General Terms Algorithms Keywords Domain information, Text summarization, User Interface 1. INTRODUCTION In the bioinformatics domain, finding the relationships be- tween genes or proteins related to field topics (e.g., tumor diseases) is an interesting and challenging task. Researchers studying the interactions between genes and proteins related to a disease or a biological process usually explore manually the published literature to find related works that discuss previously published results. Previous approaches to biological information search (e.g., iHOP [6], FACTA [11]) typically perform a keyword search in PubMed abstracts to detect the sentences containing the target keywords (e.g., a gene, or a protein). These ap- proaches yield rather heterogeneous search results, because the actual topic (i.e., the context) of the analyzed documents is not considered. Thus, they require the user to further fil- ter a scarcely focused, and possibly large, search result. Copyright is held by the author/owner(s). CIKM’10, October 25–30, 2010, Toronto, Ontario, Canada. ACM 978-1-4503-0099-5/10/10. BioSumm is a summarization environment that allows the users to query online repositories (e.g., PubMed) and extract small subsets of documents relevant to the search. BioSumm enhances keyword document search by (a) grouping a (pos- sibly large) set of retrieved documents in focused clusters, and (b) providing a multidocument summarization of each cluster. Browsing the small number of generated summaries will allow the user to select the subset(s) of documents of interest. BioSumm improves over traditional, general-purpose, au- tomatic text summarization approaches by generating ad- hoc document summaries oriented to a specific domain. Ex- perimental results performed by means of the automatic ROUGE evaluator [7] show the potential of our approach [1]. In this demo, the focus will be on gene and protein infor- mation. Summary generation exploits the knowledge pro- vided by specific domain terms (e.g., human genes, or pro- tein names). It is driven by a novel grading function, which biases sentence selection by means of an appropriate domain- specific dictionary. A description of the grading function and its application in the biological domain can be found in [5]. By only modifying the entries of the domain dictionary, our approach can be exploited in different application domains. 2. SYSTEM ARCHITECTURE BioSumm architecture is shown in Figure 1. It is fully modular and allows the user to integrate plugins addressed to a specific task (e.g., clustering, web search, text summa- rization). Furthermore, by selecting the appropriate domain dictionary, the grading function may be effectively tailored to the application domain of interest. In the following the main components of the framework are described. Online search & local repository. Given a keyword query and a target search engine (or publication repository), this module executes the query on the selected engine and returns the set of retrieved documents. The demonstrated system integrates the plugins to translate the user keyword search for Google Scholar [2], PubMed Central (PMC) [4], and PubMed [3]. Alternatively, the system also allows the user to select locally stored documents in pdf and xml for- mats, possibly produced by previous search sessions. Document structure extractor. The documents re- turned by a search session are parsed to extract the available components (e.g., title, authors, journal, abstract, body, keywords). The documents are then locally stored in a com- mon representation in XML format. Clustering. To reduce the heterogeneity of the retrieved documents, a clustering step can be optionally performed. 1961