Taking Chemistry to the Task – Personalized Queries for Chemical Digital Libraries Sascha Tönnies L3S Research Center Appelstrasse 9a 30167 Hannover Germany toennies@l3s.de Benjamin Köhncke L3S Research Center Appelstrasse 9a 30167 Hannover Germany koehncke@l3s.de Wolf-Tilo Balke IFIS TU Braunschweig Mühlenpfordtstrasse 23 38106 Braunschweig Germany balke@ifis.cs.tu-bs.de ABSTRACT Nowadays, the information access is conducted almost exclusively using the Web. Simple keyword based Web search engines, e.g. Google or Yahoo!, offer suitable retrieval and ranking features. In contrast, for highly specialized domains, represented by digital libraries, these features are insufficient. Considering the domain of chemistry, where searching for relevant literature is essentially centered on chemical entities. Beside commercial information providers such as Chemical Abstract Service (CAS) numerous groups are working on building free chemical search engines to overcome the expensive access to chemical literature. However, due to the nature of chemical queries these are often overspecialized. Often we need meaningful similarity measures for chemical entities for query relaxation. In chemistry, the similarity measures are vast; more than 40 similarity measures are available and focus on different aspects of chemical entities. This vast number of similarity measures is obvious, because the desired search results highly depend on the working field of the chemist. In this paper we present a personalized retrieval system for chemical documents taking into account the background knowledge of the individual chemist. This is done by a query relaxation for chemical entities using similar substances. We evaluate our approach extensively by analyzing the correlation of commonly used chemical similarity measures and fingerprint representations. All uncorrelated measures are finally used by our feedback engine to learn preferred similarity measures for each user. We also conducted a user study with domain experts showing that our system can assign a unique similarity measure for 75% of the users after only 10 feedback cycles. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Storage and Retrieval – Information Search and Retrieval. General Terms Measurement, Experimentation, Human Factors. Keywords Chemical Digital Libraries, Personalization, Query Relaxation 1. INTRODUCTION Today, a keyword based Web search is the starting point for almost all information gathering processes. However, in some highly specialized domains a simple keyword based search is not sufficient. For example, the information gathering process in chemistry is entity centered. As a major information provider in the domain of chemistry, CAS subsidiary of the American Chemical Society (ACS) offers a specialized digital library indexing a variety of chemical document collections. Since digital libraries promise high quality information access, the ACS is maintaining their entity database, the CAS Registry, by manually indexing all chemical entities occurring in chemical literature. Further, they annotate the documents in order to build their CAS search index for chemical literature, resulting in a high quality digital library. This quality is prohibitively, gained at the expense of high costs for the manual indexing process. Moreover, search engine access is very expensive and strictly restricted to subscribers. In an attempt to overcome the costly access to chemical literature, several groups are currently working on building free chemical search engines. Prime examples are the substance database PubChem 1 combining several chemical entity data sources and the document search engine ChemXSeer 2 . ChemXSeer relies on a highly complex process extracting chemical formulas in an automated way out of 150000 RSC publications and links them to the documents [1, 2]. Numerous publishers are also improving their information gathering process by adding chemical annotations to their documents. The prime example of this is RSC Publishing 3 utilizing the Oscar3 framework [3] to identify chemical entity names inside the document full text. These names are transformed into structural information, stored inside a structure database and linked to the document. However, these approaches still need special databases to handle chemical information. In our previous work in [4] we have shown how structural data can be used for building up index pages for chemical documents. These index pages are indexed by Google and linked to the original documents. Since synonyms and different entity 1 http://pubchem.ncbi.nlm.nih.gov/ 2 http://chemxseer.ist.psu.edu:8080/chemxseer 3 http://pubs.rsc.org/ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’11, June 13–17, 2011, Ottawa, Ontario, Canada. Copyright 2011 ACM 978-1-4503-0744-4/11/06...$10.00.