Evaluating Different Ranking Functions for Context-Based Literature Search Nattakarn Ratprasartporn, Sulieman Bani-Ahmad, Ali Cakmak, Jonathan Po, Gultekin Ozsoyoglu Department of Electrical Engineering and Computer Science Case Western Reserve University, Cleveland, Ohio 44106 {nattakarn, sulieman, cakmak, jlp25, tekin}@case.edu Abstract Context-based literature digital library search is a new search paradigm that creates an effective ranking of query outputs by controlling query output topic diversity. We define contexts as pre-specified ontology-based terms and locate the paper set of a context based on semantic properties of the context (ontology) term. In order to pro- vide a comparative assessment of papers in a context and effectively rank papers returned as search outputs, pres- tige scores are attached to all papers with respect to their assigned contexts. In this paper, we present three different prestige score (ranking) functions for the context-based environment, namely, citation-based, text-based, and pat- tern-based score functions. Using biomedical publications as the test case and Gene Ontology as the context hierar- chy, we have evaluated the proposed ranking functions in terms of their accuracy and separability. We have found that text-based and pattern-based score functions yield better accuracy and separability than citation-based score functions. 1. Introduction At the present time, search queries in literature digital libraries either lack or do not provide effective paper- scoring/ ranking functions. We argue that the main reason for the ineffectiveness of ranking functions is that they do not take into account the diversity of papers returned as the output of keyword-based search queries. Without an effective scoring and ranking system, users are forced to scan a large paper set and potentially miss important pa- pers. As an example, PubMed [1], which contains more than 14 million publications, does not have a paper- scoring/ranking system. Instead, PubMed simply lists search results in descending order of their PubMed ids or publication years. Other well-known digital libraries, such as ACM Portal [26] and Google Scholar [27], use only simple text-based and/or citation-based scores to rank search results. In an earlier work [2], we proposed a new literature digital library search paradigm, context-based search, that controls the diversity of search input topics and effec- tively ranks query output publications. Before query sub- mission, two query independent pre-processing steps are performed: assign publications into pre-specified ontol- ogy-based contexts; and compute prestige (impor- tance/ranking) scores for papers with respect to their as- signed contexts. Thus, in a given context, a paper with high prestige score is highly relevant to the context. Then, at search time, (a) only those papers in contexts of interest are involved in the search, and (b) search results in each context are ranked by their relevancy scores. A paper’s relevancy score in a context is a combination of the pa- per’s pre-computed prestige score (based on the context) and the paper-to-query matching score. Contrasting other search paradigms, context-based search output is en- hanced by a context-based paper classification, which eliminates the problem of topic diffusion and reduces output size [2]. Since only semantically related papers in contexts of interest (as opposed to all papers) are involved in the search, search output ranking is more consistent and accurate. In [2], we tested our search paradigm by using PubMed [1] papers as a testbed and Gene Ontology (GO) [3] as a context hierarchy. When compared with PubMed keyword-based search engine query results, the context- based search approach was shown experimentally [2] to reduce the query output size by up to 70% and increase the search result accuracy by up to 50%. As described above, implementing the context-based search involves five tasks: (1) assign papers to contexts, (2) compute prestige scores for papers in each context, (3) locate search contexts for a given keyword-based query, (4) perform search, and (5) rank search results. Tasks 1, 3, 4, and 5 have been extensively studied in our previous work [2]. This paper investigates task 2 as follows: • We present three different context-based prestige score functions, namely, citation-based, text-based, and pattern-based score functions. As mentioned above, to rank search results within a given context, we use (a) prestige scores of papers in the context, and (b) similarity scores between the search query and the papers. The cita- tion-based function employs the well-known PageRank algorithm [8-10], which recursively determines the pres- tige of a paper using citations to the paper and scores of papers citing the paper. While the citation-based score function uses only citation information, the text-based score function utilizes paper’s content, authors, and cita- tions as follows. First, a paper that best characterizes the context is selected as a representative paper of the con- text. Then, the text-based prestige score of a paper p in context c i is computed from (a) text-based content similar- ity, (b) author overlap, and (c) citation similarity between